## [spaCy : Faster Natural Language Processing Toolkit](https://www.kaggle.com/shivamb/spacy-text-meta-features-knowledge-graphs)
 
- Building a Basic Knowledge Graph using spaCy

In [229]:
import pandas as pd
import spacy 
from spacy import displacy

# Load the en_core_web_sm model
nlp = spacy.load("en_core_web_lg")

## Basics of Knowledge Graphs using spaCy

In this section, I have explained the basics of building knowledge graphs using spaCy. 
First, lets understand what are knoweldge graphs. 

- What are knowlege graphs ? 
> Knowledge stored in a graph form. The knowledge is captured in entities, attributes, relationships. The Nodes represents entities, NodeLabels represents attributes, and Edges represents Relationships. 

- Example:  
> Chris Nolan (Director, Producer, person) ---> born in  ----> London (place) ---> Director of  ----> Interstellar (Movie) ---> shooted in  -----> Iceland (place)  

- Source of information for building knowledge graphs: 
> Structured Text: Wikipedia, Dbpedia  
> Unstructured Text: Social Media, Blogs, Images, Videos, Audios 

#### Main ideas for building knowlege graphs

- Entity Extraction   
In this step, the aim is to extract right entities from the text data. spaCy provides NER (Named Entity Recognition) which can be used for this purpose.  

- Relationship Extraction    
In this step, the aim is to identify the relationship between the sentences / entities. Again, by using spaCy one can extract the grammar relations between two words / entities.  

- Relationship Linking    
The hard part of knowlege graphs is to identify what kind of relationship exists between the two entities. The idea is to add the contextual sense to the relationship. 

Let's look at very high level implementation of this idea using spacy. Lets load a news dataset. 

In [2]:
import os

In [19]:
path = "../data/uci-news-aggregator"
files = os.listdir(path)
df = pd.read_csv(os.path.join(path,files[0]),nrows=5000)
df[:3]

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
0,300000,World's rarest stamp smashes sale records,http://www.thetimes.co.uk/tto/news/world/ameri...,The Times \(subscription\),e,dv5GXdcteE5TwrMUj7DqIO5xDrWlM,www.thetimes.co.uk,1403086630008
1,300001,Murderer's Estate Sells Stamp for Record $9.5 ...,http://www.newsmax.com/US/du-Pont-Guyana-stamp...,Newsmax.com,e,dv5GXdcteE5TwrMUj7DqIO5xDrWlM,www.newsmax.com,1403087405702
2,300002,Stamp sells for record US$9.5m in New York,http://www.businesstimes.com.sg/breaking-news/...,THE BUSINESS TIMES \(subscription\),e,dv5GXdcteE5TwrMUj7DqIO5xDrWlM,www.businesstimes.com.sg,1403087406095


In [31]:
# Apply spacy to the article titles
df["spacy_title"] = df["TITLE"].apply(lambda x : nlp(x))

# add field of NE
df["named_entities"] = df["spacy_title"].apply(lambda x : x.ents)

df[['spacy_title','named_entities']][:3]

Unnamed: 0,spacy_title,named_entities
0,"(World, 's, rarest, stamp, smashes, sale, reco...",()
1,"(Murderer, 's, Estate, Sells, Stamp, for, Reco...","((Murderer, 's), ($, 9.5, Million))"
2,"(Stamp, sells, for, record, US$, 9.5, m, in, N...","((US$, 9.5, m), (New, York))"


## IE-Relations using POS Pattern Recognition
#### [here is a list of POS](https://sites.google.com/site/partofspeechhelp/home/nnp_nnps#TOC-Definition-of-NNPS-Proper-Noun-Plural-Form-1)
Now, we will define a grammar pattern / part of speech pattern to identify what type of relations we want to extract from the data. 

Let's we are interested in finding an action relation between two named entities. so we can define a pattern using part of speech tags as : 

Proper Noun - Verb - Proper Noun

In [32]:
pos_chain_1 = "NNP-VBZ-NNP"

Using spaCy, we can now iterate in text and identify what are the relevant triplets (governer, relation, dependent) or in other terms, what are the entities and relations.

In [None]:
index_list = list()
for i, r in df.iterrows():
    pos_chain = "-".join([d.tag_ for d in r['spacy_title']])
    if pos_chain_1 in pos_chain:
        if len(r["named_entities"]) >= 2:
            index_list.append(i)
            print (r["TITLE"])
            print (r["named_entities"])
            print (pos_chain)

So from these examples, one can see different entities and relations for example: 

- Honda --- **restructures** ---> US operations  
- Carl Icahn --- **slams** ---> eBay CEO
- Google --- **confirms** ---> Android SDK 
- GM --- **hires** ---> Lehman Brothers 

References : https://kgtutorial.github.io/


# IE relations using NER

**YOU would also want IN, eg IN-VBZ, VBZ-IN, VBZ-IN-IN, VBN-IN etc**

In [261]:
limit = 4
n = 0
for i, r in df.iterrows():
    if len(r["named_entities"]) == 2:
        ents = r["named_entities"]
        words = r['spacy_title']
        pos_chain = "-".join([d.tag_ for d in r['spacy_title']])
        for w in words[ents[0].end:ents[1].start]:
            if w.tag_ == 'VBZ':
                n += 1
                print(words)
                print(pos_chain)
                print((ents[0],ents[0].label_),
                      (w,w.tag_),
                      (ents[1],ents[1].label_),'\n')
            elif w.tag_ == 'VBN':
                n += 1
                print(words)
                print(pos_chain)
                print((ents[0],ents[0].label_),
                      (w,w.tag_),
                      (ents[1],ents[1].label_),'\n')
            else:
                pass
        
        if n == limit:
            break

Killer's rare stamp fetches $9 million
NNP-POS-JJ-NN-VBZ-$-CD-CD
(Killer, 'ORG') (fetches, 'VBZ') ($9 million, 'MONEY') 

Singer Katy Perry launches record label through Capitol
NNP-NNP-NNP-VBZ-NN-NN-IN-NNP
(Katy Perry, 'PERSON') (launches, 'VBZ') (Capitol, 'ORG') 

Katy Perry launches own record label, reveals first signee
NNP-NNP-VBZ-JJ-NN-NN-,-VBZ-JJ-NN
(Katy Perry, 'PERSON') (launches, 'VBZ') (first, 'ORDINAL') 

Katy Perry launches own record label, reveals first signee
NNP-NNP-VBZ-JJ-NN-NN-,-VBZ-JJ-NN
(Katy Perry, 'PERSON') (reveals, 'VBZ') (first, 'ORDINAL') 



Naievely creating Triplets by extracting the verbs between Entities is not that good due to:
 - It fails on complex sentence structures. 
 - It ignores other objects represented by Nouns, Propper Nouns, and Common Nouns etc. 
 - Not all ENTITY types are relevant: PERSON:ORDINAL

### Noun Chunks
In addition to the above, we might consider **[Noun Chunks](https://spacy.io/usage/linguistic-features#noun-chunks)**. You can think of noun chunks as a noun plus the words describing the noun – for example, “the lavish green grass” or “the world’s largest tech fund”.

    Text: The original noun chunk text.
    Root text: The original text of the word connecting the noun chunk to the rest of the parse.
    Root dep: Dependency relation connecting the root to its head.
    Root head text: The text of the root token’s head.
    Children: The immediate syntactic dependents of the root token.
    
 - spaCy uses the terms **head** and **child** to describe the words connected by a single arc in the dependency tree. 
 - The term **dep** is used for the arc label, which describes the type of syntactic relation that connects the child to the head.

In [255]:
dat = list()
for chunk in words.noun_chunks:
    dat.append(pd.DataFrame([chunk.text, chunk.root.text, chunk.root.dep_,chunk.root.head.text,[c for c in chunk.root.children]]).T)

print(displacy.render(words, style='dep', jupyter=True, options={'distance':110}))
print(displacy.render(words, style='ent', jupyter=True, options={'distance':110}))

dat = pd.concat(dat)
dat.columns=['Chunk','root.text','root.dep','root.head','root.child']
dat

None


None


Unnamed: 0,Chunk,root.text,root.dep,root.head,root.child
0,Kanye West,West,nsubj,opens,[Kanye]
0,the struggles,struggles,pobj,about,"[the, come]"
0,Kim Kardashian,Kardashian,dobj,dating,[Kim]


# Ownership

Named Entity followed by : [NNS/VBZ](https://sites.google.com/site/partofspeechhelp/home/nns_vbz)