# Rule based methods


In [1]:
import os
import pandas as pd
import spacy 
from spacy import displacy

# Load the en_core_web_sm model
nlp = spacy.load("en_core_web_lg")

### Load sample data set and apply Spacy NER

In [2]:
path = "../data/uci-news-aggregator"
files = os.listdir(path)
df = pd.read_csv(os.path.join(path,files[1]),nrows=500)


# Apply spacy to the article titles
df["spacy_title"] = df["TITLE"].apply(lambda x : nlp(x))

# add field of NE
df["named_entities"] = df["spacy_title"].apply(lambda x : x.ents)

df[['spacy_title','named_entities']][:3]

Unnamed: 0,spacy_title,named_entities
0,"(Apple, Unveils, Souped, Up, MacBook, Pro, wit...","((Apple), (MacBook, Pro, with, Retina))"
1,"(Delays, To, Intel, 's, Broadwell, CPUs, Bounc...","((Intel), (Broadwell))"
2,"(Apple, Cuts, Mac, Book, Pro, Price, by, Rs, 1...","((Apple), (11000), (India))"


##  Part-Of-Speech (POS) pattern recognition
[here is a list of POS](https://sites.google.com/site/partofspeechhelp/home/nnp_nnps#TOC-Definition-of-NNPS-Proper-Noun-Plural-Form-1)

Now, we will define a grammar pattern / part of speech pattern to identify what type of relations we want to extract from the data. 

Let's we are interested in finding an action relation between two named entities. so we can define a pattern using part of speech tags as : 

Proper Noun - Verb - Proper Noun

In [3]:
pos_chain_1 = "NNP-VBZ-NNP"

Using spaCy, we can now iterate in text and identify what are the relevant triplets (governer, relation, dependent) or in other terms, what are the entities and relations.

In [61]:
doc = nlp(df.loc[2,'TITLE'])
pos_chain = "-".join([d.tag_ for d in doc])
doc,pos_chain,pos_chain_1 


(Apple Cuts Mac Book Pro Price by Rs 11000 in India,
 'NNP-VBZ-NNP-NNP-NNP-NNP-IN-NNS-CD-IN-NNP',
 'NNP-VBZ-NNP')

In [62]:
import re
  
# Find start end
def substr_index(string,pattern):
    a = [(m.start(),m.end()) for m in re.finditer('{0}'.format(pattern), string)]
    return a

substr_index(pos_chain,pos_chain_1)

[(0, 11)]

In [64]:
doc.text

'Apple Cuts Mac Book Pro Price by Rs 11000 in India'

In [22]:
index_list = list()
for i, r in df.iterrows():
    pos_chain = "-".join([d.tag_ for d in r['spacy_title']])
    if pos_chain_1 in pos_chain:
        if len(r["named_entities"]) >= 2:
            index_list.append(i)
            print (r["TITLE"])
            print (pos_chain+'\n')

Apple Cuts Mac Book Pro Price by Rs 11000 in India
NNP-VBZ-NNP-NNP-NNP-NNP-IN-NNS-CD-IN-NNP

Apple gives MacBook Pro with Retina more power, more memory
NNP-VBZ-NNP-NNP-IN-NN-JJR-NN-,-JJR-NN

Apple refreshes MacBook Pro with Retina display lineup, drops prices
NNP-VBZ-NNP-NNP-IN-NN-NN-NN-,-VBZ-NNS

Apple refreshes MacBook Pro with Retina display line, adds mighty processor  ...
NNP-VBZ-NNP-NNP-IN-NN-NN-NN-,-VBZ-JJ-NN-_SP-.

Apple refreshes MacBook Pro with Retina display line, improved processors and  ...
NNP-VBZ-NNP-NNP-IN-NN-NN-NN-,-JJ-NNS-CC-_SP-.

Apple updates MacBook Pro Retina range with i7 processors
NNP-VBZ-NNP-JJ-NN-NN-IN-NN-NNS

Apple gives Retina MacBook Pros a speed boost ahead of Yosemite rollout
NNP-VBZ-NNP-NNP-NNS-DT-NN-NN-RB-IN-NNP-NN

Apple refreshes MacBook Pro with Retina display lineup with faster CPUs, more  ...
NNP-VBZ-NNP-NNP-IN-NN-NN-NN-IN-JJR-NNS-,-RBR-_SP-.

Apple updates Macbook Pro with Retina display line with more memory, faster  ...
NNP-VBZ-NNP-NNP-IN-NN

So from these examples, one can see different entities and relations for example: 

- Honda --- **restructures** ---> US operations  
- Carl Icahn --- **slams** ---> eBay CEO
- Google --- **confirms** ---> Android SDK 
- GM --- **hires** ---> Lehman Brothers 

References : https://kgtutorial.github.io/

## What about relation between Named Entities? 
The problem with the above approach is that one needs to have a comprehensive list of possible *Part-Of-Speech* tags defined a priori. In reality nouns and verbs come in a wide variety of forms and with modifiers etc. 
For instance you might also want to capture: IN, eg IN-VBZ, VBZ-IN, VBZ-IN-IN, VBN-IN etc

To overcome this you can:
 1. Constrain the type and number of relaitons you wish to find, create patterns for those. 
 2. Constrain the entities on which you wish to find relaitons such as Person named entities.
 3. Train a probabilisitc model to identify relation triplets such as [Stanford, OLLIE - see reddit]
 
Below we will try to form relations using approach 2, between named entities.

In [19]:
limit = 25
n = 0
for i, r in df.iterrows():
    if len(r["named_entities"]) == 2:
        ents = r["named_entities"]
        words = r['spacy_title']
        pos_chain = "-".join([d.tag_ for d in r['spacy_title']])
        
        # for words between each NE pair
        for w in words[ents[0].end:ents[1].start]: 
            
            if w.tag_ == 'VBZ': # if VERB is between 2 NE
                n += 1
                print(words)
                print(pos_chain)
                print((ents[0],ents[0].label_),
                      (w,w.tag_),
                      (ents[1],ents[1].label_),'\n')
                
            elif w.tag_ == 'VBN': # if VERB noun is between 2 NE
                n += 1
                print(words)
                print(pos_chain)
                print((ents[0],ents[0].label_),
                      (w,w.tag_),
                      (ents[1],ents[1].label_),'\n')
            else:
                pass
        
        if n == limit:
            break

Apple Unveils Souped Up MacBook Pro with Retina Display
NNP-VBZ-VBN-RP-NN-NNP-IN-NN-NN
(Apple, 'ORG') (Unveils, 'VBZ') (MacBook Pro with Retina, 'PRODUCT') 

Apple Unveils Souped Up MacBook Pro with Retina Display
NNP-VBZ-VBN-RP-NN-NNP-IN-NN-NN
(Apple, 'ORG') (Souped, 'VBN') (MacBook Pro with Retina, 'PRODUCT') 

Apple unveils minor bumps to MacBook Pro laptops
NNP-VBZ-JJ-NNS-IN-NNP-JJ-NNS
(Apple, 'ORG') (unveils, 'VBZ') (MacBook Pro, 'PRODUCT') 

Retina MacBook Pro gets Faster Processors, More RAM
NNP-NNP-NNP-VBZ-JJR-NNS-,-JJR-NN
(MacBook Pro, 'PRODUCT') (gets, 'VBZ') (Faster Processors, 'ORG') 

The updated Retina MacBook Pro you've been waiting for could launch tomorrow
DT-VBN-NNP-NNP-NNP-PRP-VB-VBN-VBG-IN-MD-VB-NN
(Retina MacBook Pro, 'PRODUCT') (been, 'VBN') (tomorrow, 'DATE') 

Apple updates entire MacBook Pro line-up
NNP-VBZ-JJ-NNP-JJ-NN-HYPH-NN
(Apple, 'ORG') (updates, 'VBZ') (MacBook Pro, 'PRODUCT') 

Apple lifts lid on new Retina MacBook Pros, spec boost confirmed
NNP-VBZ-NN-

Ok so using NE appears better able to capture our people and organisations. However, naievely creating Triplets by extracting the verbs between Entities is not that good due to:
 - It fails on complex sentence structures. 
 - It ignores other objects represented by Nouns, Propper Nouns, and Common Nouns etc. 
 - Not all ENTITY types are relevant: PERSON:ORDINAL

We could improve some of this by incoporating **[Noun Chunks](https://spacy.io/usage/linguistic-features#noun-chunks)**. You can think of noun chunks as a noun plus the words describing the noun – for example, “the lavish green grass” or “the world’s largest tech fund”.

    Text: The original noun chunk text.
    Root text: The original text of the word connecting the noun chunk to the rest of the parse.
    Root dep: Dependency relation connecting the root to its head.
    Root head text: The text of the root token’s head.
    Children: The immediate syntactic dependents of the root token.
    
 - spaCy uses the terms **head** and **child** to describe the words connected by a single arc in the dependency tree. 
 - The term **dep** is used for the arc label, which describes the type of syntactic relation that connects the child to the head.
 
We can extract further relations by examining the noun modifiers in the noun chunks.  

In [None]:
print(df['spacy_title'][0])

print([x for x in df['spacy_title'][0].noun_chunks])

In [None]:
words = nlp("""\
    Google is expanding its pool of machine learning talent with the purchase of a startup that specializes in 'instant' smartphone image recognition. \
    On Wednesday, French firm Moodstocks announced on its website that it's being acquired by Google, stating that it expects the deal to be completed in the next few weeks. \
    There's no word yet on how much Google is paying for the company. \
    Moodstocks' "on-device image recognition" software for smartphones will be phased out as it joins Google. \
    Moodstocks' team will also move over to Google's R&D center in Paris, according to Google's French blog. \
    "Ever since we started Moodstocks, our dream has been to give eyes to machines by \
    turning cameras into smart sensors able to make sense of their surroundings," Moodstocks said in a statement on its site.
    "Our focus will be to build great image recognition tools within Google, \
    but rest assured that current paying Moodstocks customers will be able to use it until the end of their subscription." 
    """)

words = nlp("Barack Obama was born in Hawaii.")

In [None]:
dat = list()
for chunk in df['spacy_title'][0].noun_chunks:
    dat.append(pd.DataFrame([chunk.text, chunk.root.text, chunk.root.dep_,chunk.root.head.text,[c for c in chunk.root.children]]).T)

print(displacy.render(df['spacy_title'][0], style='dep', jupyter=True, options={'distance':110}))
print(displacy.render(df['spacy_title'][0], style='ent', jupyter=True, options={'distance':110}))

dat = pd.concat(dat)
dat.columns=['Chunk','root.text','root.dep','root.head','root.child']
dat

In [None]:
pos_chain = "-".join([d.tag_ for d in df['spacy_title'][0]])
for w in words[ents[0].end:ents[1].start]:
    ents = words.ents
    if w.tag_ == 'VBZ':
        n += 1
        print(words)
        print(pos_chain)
        print((ents[0],ents[0].label_),
              (w,w.tag_),
              (ents[1],ents[1].label_),'\n')
    elif w.tag_ == 'VBN':
        n += 1
        print(words)
        print(pos_chain)
        print((ents[0],ents[0].label_),
              (w,w.tag_),
              (ents[1],ents[1].label_),'\n')

Some other factors to consider:

 - Ownership: E.g. Noun or Named Entity followed by : [NNS/VBZ](https://sites.google.com/site/partofspeechhelp/home/nns_vbz)
 - [KG and pruning](http://philipperemy.github.io/information-extract/)
  - [git](https://github.com/philipperemy/information-extraction-with-dominating-rules)
 
### References

 - [OLLIE](https://www.reddit.com/r/LanguageTechnology/comments/bovsf5/we_release_opiec_the_largest_open_information/)
 - [Clausie](https://github.com/mmxgn/clausiepy)
 - [Minie](https://github.com/mmxgn/miniepy/graphs/contributors)