## [spaCy : Faster Natural Language Processing Toolkit](https://www.kaggle.com/shivamb/spacy-text-meta-features-knowledge-graphs)
 
- Building a Basic Knowledge Graph using spaCy

In [1]:
import pandas as pd
import spacy 

# Load the en_core_web_sm model
nlp = spacy.load("en_core_web_lg")

## Basics of Knowledge Graphs using spaCy

In this section, I have explained the basics of building knowledge graphs using spaCy. 
First, lets understand what are knoweldge graphs. 

- What are knowlege graphs ? 
> Knowledge stored in a graph form. The knowledge is captured in entities, attributes, relationships. The Nodes represents entities, NodeLabels represents attributes, and Edges represents Relationships. 

- Example:  
> Chris Nolan (Director, Producer, person) ---> born in  ----> London (place) ---> Director of  ----> Interstellar (Movie) ---> shooted in  -----> Iceland (place)  

- Source of information for building knowledge graphs: 
> Structured Text: Wikipedia, Dbpedia  
> Unstructured Text: Social Media, Blogs, Images, Videos, Audios 

#### Main ideas for building knowlege graphs

- Entity Extraction   
In this step, the aim is to extract right entities from the text data. spaCy provides NER (Named Entity Recognition) which can be used for this purpose.  

- Relationship Extraction    
In this step, the aim is to identify the relationship between the sentences / entities. Again, by using spaCy one can extract the grammar relations between two words / entities.  

- Relationship Linking    
The hard part of knowlege graphs is to identify what kind of relationship exists between the two entities. The idea is to add the contextual sense to the relationship. 

Let's look at very high level implementation of this idea using spacy. Lets load a news dataset. 

In [2]:
import os

In [19]:
path = "../data/uci-news-aggregator"
files = os.listdir(path)
df = pd.read_csv(os.path.join(path,files[0]),nrows=5000)
df[:3]

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
0,300000,World's rarest stamp smashes sale records,http://www.thetimes.co.uk/tto/news/world/ameri...,The Times \(subscription\),e,dv5GXdcteE5TwrMUj7DqIO5xDrWlM,www.thetimes.co.uk,1403086630008
1,300001,Murderer's Estate Sells Stamp for Record $9.5 ...,http://www.newsmax.com/US/du-Pont-Guyana-stamp...,Newsmax.com,e,dv5GXdcteE5TwrMUj7DqIO5xDrWlM,www.newsmax.com,1403087405702
2,300002,Stamp sells for record US$9.5m in New York,http://www.businesstimes.com.sg/breaking-news/...,THE BUSINESS TIMES \(subscription\),e,dv5GXdcteE5TwrMUj7DqIO5xDrWlM,www.businesstimes.com.sg,1403087406095


In [31]:
# Apply spacy to the article titles
df["spacy_title"] = df["TITLE"].apply(lambda x : nlp(x))

# add field of NE
df["named_entities"] = df["spacy_title"].apply(lambda x : x.ents)

df[['spacy_title','named_entities']][:3]

Unnamed: 0,spacy_title,named_entities
0,"(World, 's, rarest, stamp, smashes, sale, reco...",()
1,"(Murderer, 's, Estate, Sells, Stamp, for, Reco...","((Murderer, 's), ($, 9.5, Million))"
2,"(Stamp, sells, for, record, US$, 9.5, m, in, N...","((US$, 9.5, m), (New, York))"


## POS Pattern Recognition
#### [here is a list of POS](https://sites.google.com/site/partofspeechhelp/home/nnp_nnps#TOC-Definition-of-NNPS-Proper-Noun-Plural-Form-1)
Now, we will define a grammar pattern / part of speech pattern to identify what type of relations we want to extract from the data. 

Let's we are interested in finding an action relation between two named entities. so we can define a pattern using part of speech tags as : 

Proper Noun - Verb - Proper Noun

In [32]:
pos_chain_1 = "NNP-VBZ-NNP"

Using spaCy, we can now iterate in text and identify what are the relevant triplets (governer, relation, dependent) or in other terms, what are the entities and relations.

In [38]:
index_list = list()
for i, r in df.iterrows():
    pos_chain = "-".join([d.tag_ for d in r['spacy_title']])
    if pos_chain_1 in pos_chain:
        if len(r["named_entities"]) == 2:
            index_list.append(i)
#             print (r["TITLE"])
#             print (r["named_entities"])
#             print (pos_chain)

In [98]:
words = df.loc[index_list[90]]['spacy_title']
print(words,'\n')
for ent in words.ents:
    print('Entity span "{}" [{}:{}]'.format(ent,ent.start,ent.end))

US Patent Office cancels Washington Redskins trademarks 

Entity span "US Patent Office" [0:3]
Entity span "Washington Redskins" [4:6]


In [99]:
[w.tag_ for w in df.loc[index_list[90]]['spacy_title']]

['NNP', 'NNP', 'NNP', 'VBZ', 'NNP', 'NNPS', 'NNS']

In [101]:
dir(words[0])

['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 'ancestors',
 'check_flag',
 'children',
 'cluster',
 'conjuncts',
 'dep',
 'dep_',
 'doc',
 'ent_id',
 'ent_id_',
 'ent_iob',
 'ent_iob_',
 'ent_type',
 'ent_type_',
 'get_extension',
 'has_extension',
 'has_vector',
 'head',
 'i',
 'idx',
 'is_alpha',
 'is_ancestor',
 'is_ascii',
 'is_bracket',
 'is_currency',
 'is_digit',
 'is_left_punct',
 'is_lower',
 'is_oov',
 'is_punct',
 'is_quote',
 'is_right_punct',
 'is_sent_start',
 'is_space',
 'is_stop',
 'is_title',
 'is_upper',
 'lang',
 'lang_',
 'left_edge',
 'lefts',
 'lemma',
 'lemma_',
 'lex_id',
 'like_email',
 'like_num',
 'l

So from these examples, one can see different entities and relations for example: 

- Honda --- **restructures** ---> US operations  
- Carl Icahn --- **slams** ---> eBay CEO
- Google --- **confirms** ---> Android SDK 
- GM --- **hires** ---> Lehman Brothers 

References : https://kgtutorial.github.io/