# Hypernym Relationship Extraction

In this example, we will use NLTK and Hearst Pattern for hypernym relationship extraction. 
- Firstly, install python environment
- Install NLTK: pip install nltk
- Download data distribution for NLTK. Install using NLTK downloader: ``nltk.download()``. If cannot download using ``nltk.download()``, try download manually from https://github.com/nltk/nltk_data/tree/gh-pages![image.png](attachment:image.png) or https://pan.baidu.com/s/1wONWpaa86_wnsIksKda8eQ (code:tfon )
- Unzip the downloaded file to the following folder: ``nltk.data.find(".")``
- Unzip each zip file in the ten folders: *chunkers, corpora, grammers, help, misc, models, sentiment, stemmers, taggers, tokenizers*

## Hyponym Extraction using Hearst Pattern
Hyponym extraction follows the following 4 steps:
- Noun phrase chunking or named eneity chunking. You can use any np chunking/named entity technique.
- Chunked sentences prepare. Traverse the chunked result, if the label is ``NP``, then merge all the words in this chunk and add a prefix ``NP_`` (for subsequence process).
- Chunking refinement. If two or more NPs next to each other should be merged into a single NP. Eg., *"NP_foo NP_bar blah blah"* becomes *"NP_foo_bar blah blah"*
- Find the hypernym and hyponym pairs based on the refined prepared chunked sentence.

In [1]:
import nltk
import re
from nltk import pos_tag, word_tokenize, Tree
from nltk.stem import WordNetLemmatizer 

Regular expression practice: In this example, we show one regex pattern example for Hearst pattern: ``NP such as {NP,}* {(or | and)} NP`` (https://docs.python.org/3/library/re.html)

In [2]:
regex = r"(NP_\w+ (, )?such as (NP_\w+ ?(, )?(and |or )?)+)"
test_str = "NP_1 such as NP_2 , NP_3 and NP_4 "
matches = re.search(regex, test_str)
if matches:
    # Match.group([group1, ...]) Returns one or more subgroups of the match. 
    # If there is a single argument, the result is a single string;
    # if there are multiple arguments, the result is a tuple with one item per argument. 
    # Without arguments, group1 defaults to zero (the whole match is returned).
    print(matches.group(0))

NP_1 such as NP_2 , NP_3 and NP_4 


### Step1: Chunking Sentence
- Note the result is not the chunked np, instead is the chunk tree structure

In [3]:
from nltk import ne_chunk
def np_chunking(sentence):
    # your implementation
    result = ne_chunk(pos_tag(word_tokenize(sentence)))
    return result

print(np_chunking("""I like to listen to music from musical genres,such as blues,rock and jazz."""))

(S
  I/PRP
  like/VBP
  to/TO
  listen/VB
  to/TO
  music/NN
  from/IN
  musical/JJ
  genres/NNS
  ,/,
  such/JJ
  as/IN
  blues/NNS
  ,/,
  rock/NN
  and/CC
  jazz/NN
  ./.)


### Step2: Prepare the chunked result for subsequent Hearst pattern matching
- Traverse the chunked result, if the label is ``NP``, then merge all the words in this chunk and add a prefix ``NP_``
- All the tokens are separated with a white space (``" "``) 
- Remember to lemmatize words, using ``WordNetLemmatizer`` (``from nltk.stem import WordNetLemmatizer``)

In [4]:
# prepare the chunked sentence by merging words and add prefix NP_
def prepare_chunks(chunks):
    # If chunk is NP, start with NP_ and join tokens in chunk with _ ; Else just keep the token as it is
    terms = []
    # define regex expression of NP label.
    grammar = "NP: {<JJ>*<NN.*>+}\n{<NN.*>+}\n"
    cp = nltk.RegexpParser(grammar)
    chunks = cp.parse(chunks)
    # chunks.draw()
    for chunk in chunks:
        label = None
        try:
            # see if the chunk is simply a word or a NP. But non-NP fail on this method call
            label = chunk.label()
        except:
            pass
        # Based on the label, do processing, your implementation here...
        if type(chunk) == Tree and (str(label) == 'NP' or str(label) == 'GPE' or str(label) == 'PERSON'):
            if chunk[0][0] == 'such' or chunk[0][0] == 'other':
                terms.append(chunk[0][0])
                np = "NP_" + "_".join([WordNetLemmatizer().lemmatize(a[0]) for a in chunk[1:]])
                # np = "NP_" + "_".join(a[0] for a in chunk[1:])
            else:
                np = "NP_" + "_".join([WordNetLemmatizer().lemmatize(a[0]) for a in chunk])
                # np = "NP_" + "_".join(a[0] for a in chunk)
            terms.append(np)
        else:
            terms.append(chunk[0])
    return ' '.join(terms)

In [5]:
# raw_text = "I like to listen to music from musical genres,such as blues,rock and jazz."
# raw_text = "Agar is a substance prepared from a mixture of red algae, such as Gelidium,for laboratory or industrial use."
raw_text = "... works by such authors as Herrick, Goldsmith, and Shakespeare."
chunk_res = np_chunking(raw_text)
print(prepare_chunks(chunk_res))

... NP_work by such NP_author as NP_Herrick , NP_Goldsmith , and NP_Shakespeare .


### Step3: Refinement chunking
If two or more NPs next to each other should be merged into a single NP. E.g., ``NP_foo NP_bar blah blah`` becomes ``NP_foo_bar blah blah``

In [6]:
def merge_NP(prepared_chunks):
    sentence = re.sub(r"(NP_\w+ NP_\w+)+",lambda m: m.expand(r'\1').replace(" NP_", "_"),prepared_chunks)
    return sentence

In [7]:
merge_NP("NP_foo NP_bar blah blah")

'NP_foo_bar blah blah'

### Step4: Find the hypernym and hyponyms on processed chunked results
- Define Hearst patterns. Besides the regex, we also need to specify whether the hypernym is in the first part or the second part in the pattern.
  - For example, in the pattern ``NP1 such as NP2 AND NP3``, the hypernym is the first part of the pattern; in the pattern ``NP1 , NP2 and other NP3``, the hypernym is the last part of the pattern. 
- After regex matching, find all the NPs and extract the hypernym and hyponym pairs based on the ``first`` or ``last`` attribute.
- Clean the NPs by removing the prefix ``NP_`` and ``_``

In [8]:
# Given by the prepared text, return the hypernym-hyponym pairs
def hyponym_extract(prepared_text, hearst_patterns):
    # your implementation
    pairs = []
    for (pattern, parser) in hearst_patterns:
        matches = re.search(pattern, prepared_text)
        if matches:
            match_str = matches.group(0)
            # find all NP_xx and save to a list.
            nps = [i for i in match_str.split(" ") if i.startswith("NP_")]
            if parser == "first":
                hypernym = nps[0]
                hyponyms = nps[1:]
            else:
                hypernym = nps[-1]
                hyponyms = nps[:-1]
            for item in hyponyms:
                pairs.append((item, hypernym))
    return pairs

hearst_patterns = [(r"(NP_\w+ (, )?such as (NP_\w+ ?(, )?(and |or )?)+)", "first"),
                    (r"((NP_\w+ ?(, )?)+(and |or )?other NP_\w+)", "last"),
                    (r"(such NP_\w+ (, )?as (NP_\w+ ?(, )?(and |or )?)+)", "first"),
                    (r"(NP_\w+ ?(, )?including (NP_\w+ ?(, )?(and |or )?)+)", "first"),
                    (r"(NP_\w+ ?(, )?especially (NP_\w+ ?(, )?(and |or )?)+)", "first"),]  # two examples for hearst pattern

print(hyponym_extract(prepare_chunks(np_chunking("I like to listen to music from musical genres, such as blues,rock and jazz.")), hearst_patterns))
print(hyponym_extract(prepare_chunks(np_chunking("He likes to play basketball,football and other sports.")), hearst_patterns))

[('NP_blue', 'NP_musical_genre'), ('NP_rock', 'NP_musical_genre'), ('NP_jazz', 'NP_musical_genre')]
[('NP_basketball', 'NP_sport'), ('NP_football', 'NP_sport')]


In [9]:
def find_hyponyms(sentence, hearst_patterns):
    # your implementation
    res = hyponym_extract(prepare_chunks(np_chunking(sentence)), hearst_patterns)
    return res

print(find_hyponyms("""I like to listen to music from musical genres,such as blues,rock and jazz.""", hearst_patterns))
print(find_hyponyms("""He likes to play basketball,football and other sports.""",hearst_patterns))

[('NP_blue', 'NP_musical_genre'), ('NP_rock', 'NP_musical_genre'), ('NP_jazz', 'NP_musical_genre')]
[('NP_basketball', 'NP_sport'), ('NP_football', 'NP_sport')]


In [10]:
def clean_np(term):
    return term.replace("NP_", "").replace("_", " ")
clean_np('NP_football')

'football'

## Complete Program for Hypernym extraction using Hearst Pattern

In [11]:
# Merge everything to get the final extractor
class HearstPatterns(object):
    # finish the extractor class using the aforementioned functions
    def __init__(self):
        self.hearst_patterns = [(r"(NP_\w+ (, )?such as (NP_\w+ ?(, )?(and |or )?)+)", "first"),
                        (r"((NP_\w+ ?(, )?)+(and |or )?other NP_\w+)", "last"),
                        (r"(such NP_\w+ (, )?as (NP_\w+ ?(, )?(and |or )?)+)", "first"),
                        (r"(NP_\w+ ?(, )?including (NP_\w+ ?(, )?(and |or )?)+)", "first"),
                        (r"(NP_\w+ ?(, )?especially (NP_\w+ ?(, )?(and |or )?)+)", "first"),]

    # Step1:
    def np_chunking(self, sentence):
        # your implementation
        result = ne_chunk(pos_tag(word_tokenize(sentence)))
        return result

    # Step2:
    # prepare the chunked sentence by merging words and add prefix NP_
    def prepare_chunks(self, chunks):
        # If chunk is NP, start with NP_ and join tokens in chunk with _ ; Else just keep the token as it is
        terms = []
        # define regex expression of NP label.
        grammar = "NP: {<JJ>*<NN.*>+}\n{<NN.*>+}"
        cp = nltk.RegexpParser(grammar)
        chunks = cp.parse(chunks)
        for chunk in chunks:
            label = None
            try:
                # see if the chunk is simply a word or a NP. But non-NP fail on this method call
                label = chunk.label()
            except:
                pass
            # Based on the label, do processing, your implementation here...
            if type(chunk) == Tree and (str(label) == 'NP' or str(label) == 'GPE' or str(label) == 'PERSON'):
                if chunk[0][0] == 'such' or chunk[0][0] == 'other':
                    terms.append(chunk[0][0])
                    # np = "NP_" + "_".join([WordNetLemmatizer().lemmatize(a[0]) for a in chunk[1:]])
                    np = "NP_" + "_".join(a[0] for a in chunk[1:])
                else:
                    np = "NP_" + "_".join(a[0] for a in chunk)
                terms.append(np)
            else:
                terms.append(chunk[0])
            print('in perpare_chunks')
            print(terms)
        return ' '.join(terms)

    def merge_NP(self, prepared_chunks):
        sentence = re.sub(r"(NP_\w+ NP_\w+)+",lambda m: m.expand(r'\1').replace(" NP_", "_"),prepared_chunks)
        return sentence

    # Step3: Find the hypernym and hyponyms on processed chunked results.
    # Given by the prepared text, return the hypernym-hyponym pairs.
    def hyponym_extract(self, prepared_text, hearst_patterns):
        # your implementation
        pairs = []
        for (pattern, parser) in hearst_patterns:
            matches = re.search(pattern, prepared_text)
            if matches:
                match_str = matches.group(0)
                # find all NP_xx and save to a list.
                nps = [i for i in match_str.split(" ") if i.startswith("NP_")]
                if parser == "first":
                    hypernym = nps[0]
                    hyponyms = nps[1:]
                else:
                    hypernym = nps[-1]
                    hyponyms = nps[:-1]
                for item in hyponyms:
                    pairs.append((item, hypernym))
        return pairs
    
    def clean_np(self, term):
        return term.replace("NP_", "").replace("_", " ")

    # function that call previous functions.
    def find_hyponyms(self, sentence):
        # your implementation
        result = []
        pre_chunks = merge_NP(prepare_chunks(np_chunking(sentence)))
        pairs = hyponym_extract(pre_chunks, self.hearst_patterns)
        for (hypo, hype) in pairs:
            clean_hypo = clean_np(hypo)
            clean_hype = clean_np(hype)
            result.append((clean_hypo, clean_hype))
        return result

In [12]:
# Test case for hearst patterns
hp = HearstPatterns()

test = ["Agar is a substance prepared from a mixture of red algae, such as Gelidium,for laboratory or industrial use.",
                         "... works by such authors as Herrick, Goldsmith, and Shakespeare.",
                         "... bistros, coffee shops, and other cheap eating places.",
                         "...all common law countries, including Canada and England.",
                         "...most European countries, especially France, England, and Spain."]
for txt in test:
    hps = hp.find_hyponyms(txt)
    print(hps)

[('Gelidium', 'red algae')]
[('Herrick', 'author'), ('Goldsmith', 'author'), ('Shakespeare', 'author')]
[('bistro', 'cheap eating place'), ('coffee shop', 'cheap eating place')]
[('Canada', 'common law country'), ('England', 'common law country')]
[('France', 'European country'), ('England', 'European country'), ('Spain', 'European country')]
