# Rule Based Relation-Extraction

In [1]:
import os
import pandas as pd
import re
import spacy 
from spacy import displacy

# Load the en_core_web_sm model
nlp = spacy.load("en_core_web_lg")

## Dataset

In [2]:
texts =  [
    # https://www.bbc.co.uk/news/world-us-canada-60177979
    (
        "The US East Coast is hunkering down as a major blizzard hits the region for the first time in four years."
        "The storm is forecast to stretch from the Carolinas to Maine, packing hurricane-force winds in coastal parts."
        "Five states have declared emergencies."
        "Mayor Michelle Wu of Boston, a city that is no stranger to snowfall, said the storm could be 'historic'."
        "More than two feet of snow could fall in New England."
        "Weather officials also warn of flooding near the coast."
        "Over 5,000 US flights were cancelled between Friday and Sunday, according to FlightAware."
        "Forecasters say there is a chance the storm, known as a Nor\'easter, will blanket the Boston area with up to 2ft (61cm) of snow."
        ),
    # https://www.bbc.co.uk/news/business-60163814
    (  
        "Apple sales soared in the key Christmas shopping season, despite constraints due to a global shortage of microchips."
        "Sales at the iPhone giant rose 11% to a record $123.9bn (£92.6bn) in the October to December period, beating forecasts."
        "Shares jumped more than 4% in after-hours trade, as the report suggested the firm's pandemic boom is continuing."
        "Apple has seen purchases skyrocket during the pandemic as people spend more time online."
        "The firm's market value briefly hit the $3tn milestone in early January though its share price has slipped more recently amid weeks of market turmoil."
        ),
    # https://news.sky.com/story/staycation-frenzy-spurs-center-parcs-owner-to-prepare-4bn-sale-12527982
    (
        "Sky News has learnt that Brookfield Property Partners, the Canadian property giant, is paving the way to sell Center Parcs UK potentially as soon as this year."
        "City sources said this weekend that Brookfield had engaged the accountancy firm PriceWaterhouseCoopers to assist with preparations for a sale process."
        "Investment banks have yet to be formally appointed to handle an auction, and one person close to the process said it was possible that Brookfield would decide to retain the business for a longer period if it did not secure a sufficiently attractive offer."
        "Center Parcs is one of the most famous brands in the British leisure industry, drawing millions of visitors annually to its five UK sites and the latest addition to its portfolio, at Longford Forest in Ireland."
    )
]

##  Part-Of-Speech (POS) tags

In linguistics and grammar, A part of speech or part-of-speech (POS) is the category of a word that have similar grammatical properties. 
For instance "nouns" are words for real things like people, places and objects. Words that describe nouns are called "adjectives" such as: tall, smart, large. 

Applications in Natural Language Processing (NLP) apply linguistic rules and machine learning models to predict and assign which POS tags apply by evaluating word position and context. Popular NLP packages such as NLTK and spaCy include this functionality OOTB. To read more about POS, see this [POS summary](https://towardsdatascience.com/part-of-speech-tagging-for-beginners-3a0754b2ebba), the spaCy [documentation](https://spacy.io/usage/linguistic-features#pos-tagging) and [SO explanation](https://stackoverflow.com/questions/40288323/what-do-spacys-part-of-speech-and-dependency-tags-mean), and this [POS tag reference list](https://sites.google.com/site/partofspeechhelp/#TOC-Welcome).

We can build a basic relation-extraction process by using grammar patterns / part of speech patterns to identify related nouns within a text. A simple rule might be:

```
Proper Noun - Verb - Proper Noun
```

Using spaCy, we can now iterate over each sentence and identify where this POS pattern occurs.

In [3]:
nouns = ['NNP','NN','NNS']
verbs = ["VBZ","VB","VBG"]
relations = list()

for text in texts:
    doc = nlp(text)
    for e,sent in enumerate(doc.sents):
        chain = list()
        for a in sent:
            if a.tag_ in nouns: # find first NOUND
                chain.append(a)
                for b in sent[a.i:]: # find ROOT, alternatively VERBS
                    if (b.dep_ == 'ROOT') and len(chain) == 1:
                        chain.append(b)
                        for c in sent[b.i:]: # find second NOUN
                            if c.tag_ in nouns and len(chain) == 2: 
                                chain.append(c)
                                
                                # reset chain and print result
                                relations.append(chain)
                                pos_chain = ' '.join([f"{i} ({i.tag_}|{i.dep_})" for i in sent[a.i:c.i+1]])
                                print(chain,'\n',pos_chain,'\n')
                                chain = list()
                                
                                    
            

[US, hunkering, blizzard] 
 US (NNP|compound) East (NNP|compound) Coast (NNP|nsubj) is (VBZ|aux) hunkering (VBG|ROOT) down (RP|prt) as (IN|mark) a (DT|det) major (JJ|amod) blizzard (NN|nsubj) 

[East, hunkering, blizzard] 
 East (NNP|compound) Coast (NNP|nsubj) is (VBZ|aux) hunkering (VBG|ROOT) down (RP|prt) as (IN|mark) a (DT|det) major (JJ|amod) blizzard (NN|nsubj) 

[Coast, hunkering, blizzard] 
 Coast (NNP|nsubj) is (VBZ|aux) hunkering (VBG|ROOT) down (RP|prt) as (IN|mark) a (DT|det) major (JJ|amod) blizzard (NN|nsubj) 

[Apple, soared, Christmas] 
 Apple (NN|compound) sales (NNS|nsubj) soared (VBD|ROOT) in (IN|prep) the (DT|det) key (JJ|amod) Christmas (NNP|compound) 

[sales, soared, Christmas] 
 sales (NNS|nsubj) soared (VBD|ROOT) in (IN|prep) the (DT|det) key (JJ|amod) Christmas (NNP|compound) 

[Sky, learnt, Brookfield] 
 Sky (NNP|compound) News (NNP|nsubj) has (VBZ|aux) learnt (VBN|ROOT) that (IN|mark) Brookfield (NNP|compound) 

[News, learnt, Brookfield] 
 News (NNP|nsubj) 

So, this simple approach seems OK at finding sentences that contain related entities.

- US --- **hunkering** ---> Blizzard
- Sales --- **soared** ---> Christmas
- Sky --- **learnt** ---> Brookfield

## Relation Extraction

The problem with the above approach is that it relies on an extensive list of *Part-Of-Speech* tag patterns. This won't scale for most problems as nouns and verbs come in a wide variety of forms and with modifiers etc. For instance, you generally want to capture compound term phrases and patterns such as:
```
Metro-North worker = NNP-HYPH-NNP
Killed by = VBZ-IN
```

To improve our method we can:
 1. Better capture and relfect things and objects - collectively named "entities".
     
 
 1. Develop POS patterns and rules to identify and extract relations between two or more entities. 

 1. Train a probabilistic model to identify relation triplets such as [Stanford, OLLIE - see reddit]
 
### Capture entities

A spacy pipeline with components for Named Entity Recognition (NER) and Noun Chunks is used to capture entities. Here, we could define some sensible rules or limit the number and type of entities to control what information will be represented in our knowledge graph.

In [4]:
try:
    nlp.add_pipe(nlp.create_pipe("merge_entities"))
    nlp.add_pipe(nlp.create_pipe("merge_noun_chunks"))
    print(nlp.pipe_names)
except:
    print(nlp.pipe_names)


doc = nlp(texts[2])
print('\n', ' '.join([f"{d} ({d.tag_}|{d.dep_})" for d in doc]),'\n')
spacy.displacy.render(doc, style='ent')

print('Entities:')
for t in doc:
    if t.ent_type_ != '': print('\t',t,t.ent_type_)

print('Noun chunks:')
for chunk in doc.noun_chunks:
    print('\t',chunk.text, )
    #chunk.root.text, chunk.root.dep_,chunk.root.head.text)

['tagger', 'parser', 'ner', 'merge_entities', 'merge_noun_chunks']

 Sky News (NNP|nsubj) has (VBZ|aux) learnt (VBN|ROOT) that (IN|mark) Brookfield Property Partners (NNPS|nsubj) , (,|punct) the Canadian property giant (NN|appos) , (,|punct) is (VBZ|aux) paving (VBG|ccomp) the way (NN|dobj) to (TO|aux) sell (VB|relcl) Center Parcs UK (NNP|dobj) potentially (RB|advmod) as (RB|advmod) soon (RB|advmod) as (IN|prep) this year (NN|pobj) . (.|punct) City sources (NNS|nsubj) said (VBD|ROOT) this weekend (NN|npadvmod) that (IN|mark) Brookfield (NNP|nsubj) had (VBD|aux) engaged (VBN|ccomp) the accountancy firm PriceWaterhouseCoopers (NNS|dobj) to (TO|aux) assist (VB|xcomp) with (IN|prep) preparations (NNS|pobj) for (IN|prep) a sale process (NN|pobj) . (.|punct) Investment banks (NNS|nsubj) have (VBP|ROOT) yet (RB|advmod) to (TO|aux) be (VB|auxpass) formally (RB|advmod) appointed (VBN|xcomp) to (TO|aux) handle (VB|xcomp) an auction (NN|dobj) , (,|punct) and (CC|cc) one person (NN|nsubj) close (J

Entities:
	 Sky News ORG
	 Brookfield Property Partners ORG
	 Center Parcs UK ORG
	 this year DATE
	 this weekend DATE
	 Brookfield GPE
	 one person CARDINAL
	 Brookfield GPE
	 Center Parcs ORG
	 one CARDINAL
	 millions CARDINAL
	 annually DATE
	 Longford Forest ORG
	 Ireland GPE
Noun chunks:
	 Sky News
	 Brookfield Property Partners
	 the Canadian property giant
	 the way
	 Center Parcs UK
	 this year
	 City sources
	 Brookfield
	 the accountancy firm PriceWaterhouseCoopers
	 preparations
	 a sale process
	 Investment banks
	 an auction
	 one person
	 the process
	 it
	 Brookfield
	 the business
	 a longer period
	 it
	 a sufficiently attractive offer
	 Center Parcs
	 the most famous brands
	 the British leisure industry
	 millions
	 visitors
	 its five UK sites
	 its portfolio
	 Longford Forest
	 Ireland


### Capture relations

For each sentence, relations can be extracted by iterating through the entity pairs and noun chunks, and yielding the VERB dependency or ROOT tag terms. Spacy's dependency parser operates on each sentence in isolation and so it is not possible to extract relations across sentences with this approach.

In [195]:
for sent in doc.sents: 
    s_term_pos = " ".join([f"{t} ({t.tag_}|{t.dep_})" for t in sent])
    
    entities = {int(f"{i.start}{i.end}"): {"span":i, "type":"NOUN_CHUNK"} for i in sent.noun_chunks}
    entities.update({int(f"{i.start}{i.end}"): {"span":i, "type":i.label_} for i in sent.ents})
    keys = sorted(entities)
    
    if len(keys) > 1:
        pairs = [(x,y) for x,y in zip(keys,keys[1:])]
        print('\n',s_term_pos)
        # print(keys,pairs)
        for p in pairs:
            start,end = entities[p[0]], entities[p[1]]
            for w in doc[start['span'].start:end['span'].end]:
                if w.tag_ in ['VBZ','VBN','VBG','VBD','VB']:
                    print(f"\t>>>\t",f"{start['span']} - {w}({w.tag_}|{w.dep_}) - {end['span']}")
                    print(f"\t>>>\t",f"{start['span']} - {sides(w)} - {end['span']}")


 Sky News (NNP|nsubj) has (VBZ|aux) learnt (VBN|ROOT) that (IN|mark) Brookfield Property Partners (NNPS|nsubj) , (,|punct) the Canadian property giant (NN|appos) , (,|punct) is (VBZ|aux) paving (VBG|ccomp) the way (NN|dobj) to (TO|aux) sell (VB|relcl) Center Parcs UK (NNP|dobj) potentially (RB|advmod) as (RB|advmod) soon (RB|advmod) as (IN|prep) this year (NN|pobj) . (.|punct)
	>>>	 Sky News - has(VBZ|aux) - Brookfield Property Partners
	>>>	 Sky News - has - Brookfield Property Partners
	>>>	 Sky News - learnt(VBN|ROOT) - Brookfield Property Partners
	>>>	 Sky News - Sky News has learnt paving . - Brookfield Property Partners
	>>>	 the Canadian property giant - is(VBZ|aux) - the way
	>>>	 the Canadian property giant - is - the way
	>>>	 the Canadian property giant - paving(VBG|ccomp) - the way
	>>>	 the Canadian property giant - that Brookfield Property Partners is paving the way - the way
	>>>	 the way - sell(VB|relcl) - Center Parcs UK
	>>>	 the way - to sell Center Parcs UK potent

Better - but not great. There's a few improvements we could make to 

In [198]:
def relation(start,end):
    relations = list()
    for word in doc[start['span'].start:end['span'].end]:
        if word.tag_ in ['VBZ','VBN','VBG','VBD','VB']:
            
            left = list(word.lefts)
            left = [i for i in left if len(left) > 0 and i.i > start['span'].end]
            
            right = list(word.rights)
            right = [i for i in left if len(right) > 0 and i.i < end['span'].start]
            
            verb = ' '.join([l.text for l in left] + [word.text] + [r.text for r in right])
            relations.append((start['span'],verb,end['span']))
    if len(relations) > 0:
        return relations

In [199]:
for sent in list(doc.sents)[:1]: 
    s_term_pos = " ".join([f"{t} ({t.tag_}|{t.dep_})" for t in sent])
    
    entities = {int(f"{i.start}{i.end}"): {"span":i, "type":"NOUN_CHUNK"} for i in sent.noun_chunks}
    entities.update({int(f"{i.start}{i.end}"): {"span":i, "type":i.label_} for i in sent.ents})
    keys = sorted(entities)
    
    if len(keys) > 1:
        print(sent)
        pairs = [(x,y) for x,y in zip(keys,keys[1:])]
        for p in pairs:
            start,end = entities[p[0]], entities[p[1]]
            rel = relation(start,end)
            if rel is not None:
                for r in rel:
                    print('\t',r)

Sky News has learnt that Brookfield Property Partners, the Canadian property giant, is paving the way to sell Center Parcs UK potentially as soon as this year.
	 (Sky News, 'has', Brookfield Property Partners)
	 (Sky News, 'learnt', Brookfield Property Partners)
	 (the Canadian property giant, 'is', the way)
	 (the Canadian property giant, 'is paving is', the way)
	 (the way, 'sell', Center Parcs UK)


In [209]:
[(t,t.dep_,t.tag_) for t in sent]

[(Sky News, 'nsubj', 'NNP'),
 (has, 'aux', 'VBZ'),
 (learnt, 'ROOT', 'VBN'),
 (that, 'mark', 'IN'),
 (Brookfield Property Partners, 'nsubj', 'NNPS'),
 (,, 'punct', ','),
 (the Canadian property giant, 'appos', 'NN'),
 (,, 'punct', ','),
 (is, 'aux', 'VBZ'),
 (paving, 'ccomp', 'VBG'),
 (the way, 'dobj', 'NN'),
 (to, 'aux', 'TO'),
 (sell, 'relcl', 'VB'),
 (Center Parcs UK, 'dobj', 'NNP'),
 (potentially, 'advmod', 'RB'),
 (as, 'advmod', 'RB'),
 (soon, 'advmod', 'RB'),
 (as, 'prep', 'IN'),
 (this year, 'pobj', 'NN'),
 (., 'punct', '.')]

In [212]:
spacy.displacy.render(sent, style='dep')


Ok so using NE appears better able to capture our people and organisations. However, naievely creating Triplets by extracting the verbs between Entities is not that good due to:
 - It fails on complex sentence structures. 
 - It ignores other objects represented by Nouns, Propper Nouns, and Common Nouns etc. 
 - Not all ENTITY types are relevant: PERSON:ORDINAL

We could improve some of this by incoporating **[Noun Chunks](https://spacy.io/usage/linguistic-features#noun-chunks)**. You can think of noun chunks as a noun plus the words describing the noun – for example, “the lavish green grass” or “the world’s largest tech fund”.

    Text: The original noun chunk text.
    Root text: The original text of the word connecting the noun chunk to the rest of the parse.
    Root dep: Dependency relation connecting the root to its head.
    Root head text: The text of the root token’s head.
    Children: The immediate syntactic dependents of the root token.
    
 - spaCy uses the terms **head** and **child** to describe the words connected by a single arc in the dependency tree. 
 - The term **dep** is used for the arc label, which describes the type of syntactic relation that connects the child to the head.
 
We can extract further relations by examining the noun modifiers in the noun chunks.  

Some other factors to consider:

 - Ownership: E.g. Noun or Named Entity followed by : [NNS/VBZ](https://sites.google.com/site/partofspeechhelp/home/nns_vbz)
 - [KG and pruning](http://philipperemy.github.io/information-extract/)
  - [git](https://github.com/philipperemy/information-extraction-with-dominating-rules)
 
## References

 - [OLLIE](https://www.reddit.com/r/LanguageTechnology/comments/bovsf5/we_release_opiec_the_largest_open_information/)
 - [Clausie](https://github.com/mmxgn/clausiepy)
 - [Minie](https://github.com/mmxgn/miniepy/graphs/contributors)

SyntaxError: invalid syntax (3895470108.py, line 1)

In [None]:
print(findSVAOs(doc))