## Spacy's NER and WD entity linking

We use Spacy to find and type named edntities and also find noun chunks and then try linking them with the wd search

In [1]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_md  #medium en pipeline

import wd_search as wds

### Load one of Spacy's language models. This is a medium sized one for English

In [2]:
nlp = spacy.load("en_core_web_md")

spacy_entity_linker


### Input our text and run it through the Spacy pipeline

In [3]:
text = """The White House on Friday issued a statement condemning a series of "brutal" attacks in West Africa, 
 including the kidnapping of more than 100 schoolgirls and murder of aid workers in Nigeria. White House press 
 secretary Sarah Huckabee Sanders offered the Trump administration's "deepest sympathies to the families and 
 friends of those killed" and expressed resolve to hold violent extremists responsible.

"These attacks only strengthen the resolve of the United States and responsible nations to pursue, destroy, and 
 rid the world of those who commit such heinous acts," Sanders said.

The Trump administration's statement mentioned a terrorist attack Friday in Burkina Faso by armed Islamist militants, 
 which led to the deaths of at least eight members of local security forces. Eight militants were also reportedly 
 killed. Other attacks in the region, which have occurred over the past month, include the abduction of 110 schoolgirls 
 in Nigeria on Feb. 19, and Wednesday's attack that killed four United Nations peacekeepers in Mali. 

The militants' targets Friday in Burkina Faso included military headquarters and the French Embassy.
The State Department issued a travel advisory Friday, urging Americans to avoid traveling to the country. 

“Terrorist groups continue plotting attacks in Burkina Faso,” the State Department said. “Terrorists may conduct 
 attacks anywhere with little or no warning. Targets could include hotels, restaurants, police stations, customs 
 offices, military posts, and schools.” 
"""
doc = nlp(text)
print("done!")

done!


### Display the text marking its entities and their types.  The default types are the 18 types from [Ontonotes](https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf)

In [4]:
displacy.render(doc, style="ent")

### Get the entity mentions and their types

In [5]:
spacy_entities = [(X.text, X.label_) for X in doc.ents]
print(spacy_entities)

[('The White House', 'ORG'), ('Friday', 'DATE'), ('West Africa', 'GPE'), ('more than 100', 'CARDINAL'), ('Nigeria', 'GPE'), ('White House', 'ORG'), ('Sarah Huckabee Sanders', 'PERSON'), ('Trump', 'PERSON'), ('the United States', 'GPE'), ('Sanders', 'PERSON'), ('Friday', 'DATE'), ('Burkina Faso', 'GPE'), ('Islamist', 'NORP'), ('at least eight', 'CARDINAL'), ('Eight', 'CARDINAL'), ('the past month', 'DATE'), ('110', 'CARDINAL'), ('Nigeria', 'GPE'), ('Feb. 19', 'DATE'), ('Wednesday', 'DATE'), ('four', 'CARDINAL'), ('United Nations', 'ORG'), ('Mali', 'GPE'), ('Friday', 'DATE'), ('Burkina Faso', 'GPE'), ('the French Embassy', 'ORG'), ('The State Department', 'ORG'), ('Friday', 'DATE'), ('Americans', 'NORP'), ('Burkina Faso', 'GPE'), ('the State Department', 'ORG'), ('Terrorists', 'ORG')]


### We'll use a simple link function again

In [6]:
def link(string, type): # just return the top hit
    result = wds.wd_scale_search(string, target_types=[type], dbpedia=0, top=1)
    return result[0] if result else {}

### Try to link them using their Spacy-recognized types

In [9]:
wd_entities = [link(ent[0], ent[1]) for ent in spacy_entities]
print('done!')

done!


In [12]:
for se, wde in zip(spacy_entities, wd_entities):
    print(f"{se} => {wds.summary(wde)}")

('The White House', 'ORG') => ('Q35525', 'White House', 'official residence and workplace of the President of the United States.', 'https://wikidata.org/wiki/Q35525')
('Friday', 'DATE') => ('Q130', 'Friday', 'day of the week', 'https://wikidata.org/wiki/Q130')
('West Africa', 'GPE') => ('Q953068', 'South-West Africa', 'former country, a mandate of South Africa', 'https://wikidata.org/wiki/Q953068')
('more than 100', 'CARDINAL') => ('', '', '', '')
('Nigeria', 'GPE') => ('Q1033', 'Nigeria', 'sovereign state in West Africa', 'https://wikidata.org/wiki/Q1033')
('White House', 'ORG') => ('Q35525', 'White House', 'official residence and workplace of the President of the United States.', 'https://wikidata.org/wiki/Q35525')
('Sarah Huckabee Sanders', 'PERSON') => ('Q27986907', 'Sarah Sanders', 'American political press secretary', 'https://wikidata.org/wiki/Q27986907')
('Trump', 'PERSON') => ('Q22686', 'Donald Trump', '45th president of the United States', 'https://wikidata.org/wiki/Q22686')


### Noun chunks might correspond to nominal entity mentions or concept mentions
 but we will have to remove the named entities and filter these to eliminate some and trim others.  Also, co-refefrence will be helpful.

In [14]:
noun_chunks = [(X.text, X.label_) for X in doc.noun_chunks]
print(noun_chunks)



fin