# Rule Based Relation-Extraction

In [1]:
import os
import pandas as pd
import re
import spacy 
from spacy import displacy

# Load the en_core_web_sm model
nlp = spacy.load("en_core_web_lg")

## Dataset

In [2]:
path = "../data/uci-news-aggregator"
files = os.listdir(path)
df = pd.read_csv(os.path.join(path,files[1]),nrows=500)


# Apply spacy to the article titles
df["spacy_title"] = df["TITLE"].apply(lambda x : nlp(x))

# add field of NE
df["named_entities"] = df["spacy_title"].apply(lambda x : x.ents)

df[['spacy_title','named_entities']][:3]

Unnamed: 0,spacy_title,named_entities
0,"(Fed, official, says, weak, data, caused, by, ...","((Fed),)"
1,"(Fed, 's, Charles, Plosser, sees, high, bar, f...","((Fed), (Charles, Plosser))"
2,"(US, open, :, Stocks, fall, after, Fed, offici...","((US), (Fed))"


##  Part-Of-Speech (POS) tags

In linguistics and grammar, A part of speech or part-of-speech (POS) is the category of a word that have similar grammatical properties. 
For instance "nouns" are words for real things like people, places and objects. Words that describe nouns are called "adjectives" such as: tall, smart, large. 

Applications in Natural Language Processing (NLP) apply linguistic rules and machine learning models to predict and assign which POS tags apply by evaluating word position and context. Popular NLP packages such as NLTK and spaCy include this functionality OOTB. To read more about POS, see this [POS summary](https://towardsdatascience.com/part-of-speech-tagging-for-beginners-3a0754b2ebba), the spaCy [documentation](https://spacy.io/usage/linguistic-features#pos-tagging) and [SO explanation](https://stackoverflow.com/questions/40288323/what-do-spacys-part-of-speech-and-dependency-tags-mean), and this [POS tag reference list](https://sites.google.com/site/partofspeechhelp/#TOC-Welcome).

We can build a basic relation-extraction process by using grammar patterns / part of speech patterns to identify related nouns within a text. A simple rule might be:

```
Proper Noun - Verb - Proper Noun
```

Using spaCy, we can now iterate over each sentence and identify where this POS pattern occurs.

In [42]:
pos_pattern = "NNP-VBZ-NNP"

counter = 0
limit = 8
index_list = list()

for i, r in df.sample(frac=0.7).iterrows():
    pos_chain = "-".join([d.tag_ for d in r['spacy_title']])
    if pos_pattern in pos_chain:
        if len(r["named_entities"]) >= 2:
            index_list.append(i)
            print (f"{i}\t{r.TITLE}\n\t{pos_chain}\n",)
            
            counter+=1
            if counter==limit:
                break

490	EBay rejects Icahn slate of directors
	NNP-VBZ-NNP-NN-IN-NNS

55	Icahn Targets Ebay Chief Donahoe After Company Rejects Board Slate
	NNP-NNP-NNP-NNP-NNP-IN-NNP-VBZ-NNP-NNP

66	eBay's John Donahoe talks Icahn, conflicts, and $100 stock price (someday)
	NNP-POS-NNP-NNP-VBZ-NNP-,-NNS-,-CC-$-CD-NN-NN--LRB--RB--RRB-

20	Noyer Says Strong Euro Creates Unwarranted Economic Pressure (1)
	NNP-VBZ-NNP-NNP-VBZ-JJ-NNP-NN--LRB--CD--RRB-

68	Carl Icahn slams eBay CEO
	NNP-NNP-VBZ-NNP-NNP

320	Japan says Bitcoin's not currency, but taxable
	NNP-VBZ-NNP-POS-RB-NN-,-CC-JJ

7	Fed's Plosser expects US unemployment to fall to 6.2% by the end of 2014
	NNP-POS-NNP-VBZ-NNP-NN-TO-VB-IN-CD-NN-IN-DT-NN-IN-CD

368	Weil on Finance: Bill Ackman Keeps Hope Alive
	NNP-IN-NNP-:-NNP-NNP-VBZ-NNP-NNP



In [6]:
## examples to pull out patterns

# doc = nlp(df.loc[7,'TITLE'])
# pos_chain = "-".join([d.tag_ for d in doc])
# doc,pos_chain,pos_pattern 

# import re
  
# # Find start end
# def substr_index(string,pattern):
#     a = [(m.start(),m.end()) for m in re.finditer('{0}'.format(pattern), string)]
#     return a

# substr_index(pos_chain,pos_pattern)
# [(t.i,t,t.tag_ ) for t in doc]

So, this simple approach seems OK at finding sentences with related entities and thei relationships.

- Honda --- **restructures** ---> US operations  
- Carl Icahn --- **slams** ---> eBay CEO
- Google --- **confirms** ---> Android SDK 
- GM --- **hires** ---> Lehman Brothers 

## POS patterns

The problem with the above approach is that it relies on an extensive list of *Part-Of-Speech* tag patterns. This won't scale for most problems as nouns and verbs come in a wide variety of forms and with modifiers etc. For instance, you generally want to capture compound term phrases and patterns such as:
```
Metro-North worker = NNP-HYPH-NNP
Killed by = VBZ-IN
```

To improve our method we can:
 1. Use Named Entity Recognition and Noun Chunks to better capture things and objects.
     - Also, we can define sensible rules to limit the type of entities on which to find relations between.
 
 1. Constrain the type and number of relations you wish to find, create patterns for those. 

 1. Train a probabilistic model to identify relation triplets such as [Stanford, OLLIE - see reddit]
 
Below we will try to form relations using approach 2, between named entities.

In [58]:
try:
    nlp.add_pipe(nlp.create_pipe("merge_entities"))
    nlp.add_pipe(nlp.create_pipe("merge_noun_chunks"))
    print(nlp.pipe_names)
except:
    print(nlp.pipe_names)

['tagger', 'parser', 'ner', 'merge_entities', 'merge_noun_chunks']


In [274]:
text = "Ben Postance lives in Birmingham. Ben is about 30 years old. Fed's Plosser expects United States unemployment to fall to 6.2% by the end of 2014."
doc = nlp(text)
print(' '.join([f"{d} ({d.tag_}|{d.dep_})" for d in doc]),'\n')
spacy.displacy.render(doc, style='ent')

Ben Postance (NNP|nsubj) lives (VBZ|ROOT) in (IN|prep) Birmingham (NNP|pobj) . (.|punct) Ben (NNP|nsubj) is (VBZ|ROOT) about 30 years old (JJ|acomp) . (.|punct) Fed's Plosser (NNP|nsubj) expects (VBZ|ROOT) United States (NNP|nsubj) unemployment (NN|nsubj) to (TO|aux) fall (VB|ccomp) to (IN|prep) 6.2% (NN|pobj) by (IN|prep) the end of 2014 (NN|pobj) . (.|punct) 



In [275]:
print('Entities:')
for t in doc:
    if t.ent_type_ != '': print('\t',t,t.ent_type_)

print('Noun chunks:')
for chunk in doc.noun_chunks:
    print('\t',chunk.text, )
    #chunk.root.text, chunk.root.dep_,chunk.root.head.text)

Entities:
	 Ben Postance PERSON
	 Birmingham GPE
	 Ben PERSON
	 about 30 years old DATE
	 Fed's Plosser ORG
	 United States GPE
	 6.2% PERCENT
	 the end of 2014 DATE
Noun chunks:
	 Ben Postance
	 Birmingham
	 Ben
	 Fed's Plosser
	 United States
	 6.2%
	 the end of 2014


In [276]:
for s in doc.sents: 
    ents = s.ents
    if len(ents) > 1:
        pairs = [(x,y) for x,y in zip(ents,ents[1:])]
        for p in pairs:
            for w in doc[p[0].start:p[1].end]:
                if w.tag_ in ['VBZ','VBN','VBG','VBD','VB']:
                    print(f"{s}\t>>>\t",f"{p[0]} - {w}({w.tag_}) - {p[1]}")

Ben Postance lives in Birmingham.	>>>	 Ben Postance - lives(VBZ) - Birmingham
Ben is about 30 years old.	>>>	 Ben - is(VBZ) - about 30 years old
Fed's Plosser expects United States unemployment to fall to 6.2% by the end of 2014.	>>>	 Fed's Plosser - expects(VBZ) - United States
Fed's Plosser expects United States unemployment to fall to 6.2% by the end of 2014.	>>>	 United States - fall(VB) - 6.2%


In [278]:
def get_relation(doc):
    
    pre_token=doc[0]
    for t in doc:
        try:
            next_token = doc[t.i+1]
        except:
            next_token = t

        if t.tag_ in ['VBZ','VBN','VBG','VBD','VB']:
            if any(re.findall(r'to|in|aux|prep', pre_token.tag_, re.IGNORECASE)):
                word = pre_token.text + " " + t.text
            elif any(re.findall(r'to|in|aux|prep', next_token.tag_, re.IGNORECASE)):
                word = t.text + " " + next_token.text
            else:
                word = t.text
            return word

        pre_token = t 

In [279]:
for s in doc.sents: 
    ents = s.ents
    if len(ents) > 1:
        pairs = [(x,y) for x,y in zip(ents,ents[1:])]
        for p in pairs:
            w = get_relation(doc[p[0].start:p[1].end])
            if w is not None: 
                print(f"{s}\t>>>\t",f"{p[0]} - {w} - {p[1]}")

Ben Postance lives in Birmingham.	>>>	 Ben Postance - lives in - Birmingham
Ben is about 30 years old.	>>>	 Ben - is - about 30 years old
Fed's Plosser expects United States unemployment to fall to 6.2% by the end of 2014.	>>>	 Fed's Plosser - expects - United States
Fed's Plosser expects United States unemployment to fall to 6.2% by the end of 2014.	>>>	 United States - to fall - 6.2%


In [205]:

from spacy.matcher import Matcher

In [198]:
s

Fed's Plosser expects United States unemployment to fall to 6.2% by the end of 2014.

In [202]:
get_entities("Fed's Plosser expects United States unemployment to fall to 6.2% by the end of 2014.")

['unemployment', 'the end of 2014']

In [206]:
get_relation("Fed's Plosser expects United States unemployment to fall to 6.2% by the end of 2014.")

'expects'

In [203]:
def get_entities(sent):
    ## chunk 1
    ent1 = ""
    ent2 = ""

    prv_tok_dep = ""    # dependency tag of previous token in the sentence
    prv_tok_text = ""   # previous token in the sentence

    prefix = ""
    modifier = ""

    #############################################################

    for tok in nlp(sent):
        ## chunk 2
        # if token is a punctuation mark then move on to the next token
        if tok.dep_ != "punct":
            # check: token is a compound word or not
            if tok.dep_ == "compound":
                prefix = tok.text
            # if the previous word was also a 'compound' then add the current word to it
            if prv_tok_dep == "compound":
                prefix = prv_tok_text + " "+ tok.text

          # check: token is a modifier or not
        if tok.dep_.endswith("mod") == True:
            modifier = tok.text
            # if the previous word was also a 'compound' then add the current word to it
            if prv_tok_dep == "compound":
                modifier = prv_tok_text + " "+ tok.text

          ## chunk 3
        if tok.dep_.find("subj") == True:
            ent1 = modifier +" "+ prefix + " "+ tok.text
            prefix = ""
            modifier = ""
            prv_tok_dep = ""
            prv_tok_text = ""      

        ## chunk 4
        if tok.dep_.find("obj") == True:
            ent2 = modifier +" "+ prefix +" "+ tok.text

        ## chunk 5  
        # update variables
        prv_tok_dep = tok.dep_
        prv_tok_text = tok.text
        #############################################################

    return [ent1.strip(), ent2.strip()]

def get_relation(sent):
    doc = nlp(sent)

    # Matcher class object 
    matcher = Matcher(nlp.vocab)

    #define the pattern 
    pattern = [{'DEP':'ROOT'}, 
            {'DEP':'prep','OP':"?"},
            {'DEP':'agent','OP':"?"},  
            {'POS':'ADJ','OP':"?"}] 

    matcher.add("matching_1", None, pattern) 

    matches = matcher(doc)
    k = len(matches) - 1

    span = doc[matches[k][1]:matches[k][2]] 

    return(span.text)

Ok so using NE appears better able to capture our people and organisations. However, naievely creating Triplets by extracting the verbs between Entities is not that good due to:
 - It fails on complex sentence structures. 
 - It ignores other objects represented by Nouns, Propper Nouns, and Common Nouns etc. 
 - Not all ENTITY types are relevant: PERSON:ORDINAL

We could improve some of this by incoporating **[Noun Chunks](https://spacy.io/usage/linguistic-features#noun-chunks)**. You can think of noun chunks as a noun plus the words describing the noun – for example, “the lavish green grass” or “the world’s largest tech fund”.

    Text: The original noun chunk text.
    Root text: The original text of the word connecting the noun chunk to the rest of the parse.
    Root dep: Dependency relation connecting the root to its head.
    Root head text: The text of the root token’s head.
    Children: The immediate syntactic dependents of the root token.
    
 - spaCy uses the terms **head** and **child** to describe the words connected by a single arc in the dependency tree. 
 - The term **dep** is used for the arc label, which describes the type of syntactic relation that connects the child to the head.
 
We can extract further relations by examining the noun modifiers in the noun chunks.  

Some other factors to consider:

 - Ownership: E.g. Noun or Named Entity followed by : [NNS/VBZ](https://sites.google.com/site/partofspeechhelp/home/nns_vbz)
 - [KG and pruning](http://philipperemy.github.io/information-extract/)
  - [git](https://github.com/philipperemy/information-extraction-with-dominating-rules)
 
## References

 - [OLLIE](https://www.reddit.com/r/LanguageTechnology/comments/bovsf5/we_release_opiec_the_largest_open_information/)
 - [Clausie](https://github.com/mmxgn/clausiepy)
 - [Minie](https://github.com/mmxgn/miniepy/graphs/contributors)