[**Blueprints for Text Analysis Using Python**](https://github.com/blueprints-for-text-analytics-python/blueprints-text)  
Jens Albrecht, Sidharth Ramachandran, Christian Winkler

**If you like the book or the code examples here, please leave a friendly comment on [Amazon.com](https://www.amazon.com/Blueprints-Text-Analytics-Using-Python/dp/149207408X)!**
<img src="../rating.png" width="100"/>


# Chapter 12:<div class='tocSkip'/>

# Building a Knowledge Graph

## Remark<div class='tocSkip'/>

The code in this notebook differs slightly from the printed book. For example we frequently use pretty print (`pp.pprint`) instead of `print` and `tqdm`'s `progress_apply` instead of Pandas' `apply`. 

Moreover, several layout and formatting commands, like `figsize` to control figure size or subplot commands are removed in the book.

You may also find some lines marked with three hashes ###. Those are not in the book as well as they don't contribute to the concept.

All of this is done to simplify the code in the book and put the focus on the important parts instead of formatting.

## Setup<div class='tocSkip'/>

Set directory locations. If working on Google Colab: copy files and install required libraries.

**On Colab:** Use runtime **with GPU (Menu&rarr;Runtime&rarr;Change runtime type)** for better performance **before** you start this notebook.

In [None]:
import sys, os
ON_COLAB = 'google.colab' in sys.modules

if ON_COLAB:
    GIT_ROOT = 'https://github.com/blueprints-for-text-analytics-python/blueprints-text/raw/master'
    os.system(f'wget {GIT_ROOT}/ch12/setup.py')

%run -i setup.py

## Load Python Settings<div class="tocSkip"/>

Common imports, defaults for formatting in Matplotlib, Pandas etc.

In [None]:
%run "$BASE_DIR/settings.py"

%reload_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'png'

# to print output of all statements and not just the last
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
sys.path.append(BASE_DIR + '/packages') # to import blueprints package

from blueprints.knowledge import display_ner, reset_pipeline, print_dep_tree, alias_lookup

## What you'll learn and what we build


# Knowledge Graphs


## Blueprint to Query Wikidata for Aliases not in Book

Below you find an example of what you can do with public ontologies like Wikidata. Here, we defined a SPARQL query to retrieve the names, aliases and URLs of all entities of type "United States federal executive department" (https://www.wikidata.org/wiki/Q910252).

In [None]:
# pip install sparqlwrapper
# https://rdflib.github.io/sparqlwrapper/

import sys
from SPARQLWrapper import SPARQLWrapper, JSON

endpoint_url = "https://query.wikidata.org/sparql"

query = """
SELECT ?org ?orgLabel ?aliases ?urlLabel ?country ?countryLabel WITH {
  SELECT ?org (group_concat(distinct ?alias;separator=",") as ?aliases)
  WHERE {
    ?org wdt:P31 wd:Q910252. # org is(P31) US department (Q910252)
    ?org skos:altLabel ?alias. filter(lang(?alias)="en")
  } GROUP BY ?org } AS %i
  WHERE {
  include %i
  ?org wdt:P856 ?url; # has official website (P856)
       wdt:P17 ?country. # has country (P17)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
ORDER BY ?orgLabel
"""

def sparql_df(endpoint_url, query):
    user_agent = "Wikidata-Service Python/%s.%s" % (sys.version_info[0], sys.version_info[1])
    sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    results = sparql.query().convert()

    columns = results['head']['vars']
    rows = []
    for result in results["results"]["bindings"]:
        row = {}
        for col in result:
            row[col] = result[col]['value']
        rows.append(row)

    return pd.DataFrame.from_records(rows, columns=columns)

wd_df = sparql_df(endpoint_url, query)

# rename columns
wd_df.columns = ['org_id', 'org', 'aliases', 'url', 'country_id', 'country']

wd_df['org_id'] = wd_df['org_id'].str.replace('http://www.wikidata.org/entity/', '')
wd_df['country_id'] = wd_df['country_id'].str.replace('http://www.wikidata.org/entity/', '')
wd_df['aliases'] = wd_df['aliases'].str.split(',')

wd_df.head(10)

## Building a Knowledge Graph


# Introducing the Data Set


In [None]:
import nltk
nltk.download('reuters')

## Data Preparation of NLTK Reuters Corpus (not in book)

This section contains the steps how to create the data frame for some of the examples.

In [None]:
from nltk.corpus import reuters

# List of documents
documents = reuters.fileids()
print(str(len(documents)) + " documents")
print(str(len(reuters.categories())) + " categories:")
print(reuters.categories()[:10] + ['...'])

print(reuters.readme()[:200])

Each article is stored as a separated file. The data files are identified by a file ID of the form "train/1234" or "test/5678". We first create a data frame with the `fileid` column and then load the raw text for each ID into a second column. Finally, as we don't care whether it's train or test, we just the number from the file ID and use it as the index of our data frame.

In [None]:
from nltk.corpus import reuters

# create fileid column 
df = pd.DataFrame(reuters.fileids("acq"), columns=['fileid'])
# load raw texts
df['raw'] = df['fileid'].progress_map(lambda f: reuters.raw(f))
# set index to numeric id
df.index = df['fileid'].map(lambda f: int(f.split('/')[1]))
df.index.name = None
df = df.drop(columns=['fileid']).sort_index()

df.sample(3, random_state=12)

As we see from the example, we will still need some data cleaning before we can expect to get reasonably good results during named entity recognition. First, we separate headlines from the actual news text by splitting at the first newline.

In [None]:
df[['headline', 'raw_text']] = df.progress_apply(lambda row: row['raw'].split('\n', 1), 
                                        axis='columns', result_type='expand')

Now we use the adapted data cleaning blueprint from Chapter 4 for to remove some disturbing artifacts, substitute some abbreviations (like "dlr" for dollar) and repair some typos.

In [None]:
def clean(text):
    text = text.replace('&lt;','<') # html escape
    text = re.sub(r'[<>]', '"', text) # quotation marks instead of <>
    text = re.sub(r'[ ]*"[A-Z\.]+"', '', text) # drop stock symbols
    text = re.sub(r'[ ]*\([A-Z\.]+\)', '', text) # drop stock symbols
    text = re.sub(r'\bdlr(s?)\b', r'dollar\1', text, flags=re.I)
    text = re.sub(r'\bmln(s?)\b', r'million\1', text, flags=re.I)
    text = re.sub(r'\bpct\b', r'%', text, flags=re.I)
    # normalize INC to Inc
    text = re.sub(r'\b(Co|Corp|Inc|Plc|Ltd)\b', lambda m: m.expand(r'\1').capitalize(), text, flags=re.I)
    text = re.sub(r'"', r'', text) # quotation marks
    text = re.sub(r'\s+', ' ', text) # multiple whitespace by one
    text = re.sub(r'acquisiton', 'acquisition', text) # typo
    text = re.sub(r'Nippon bLife', 'Nippon Life', text) # typo
    text = re.sub(r'COMSAT.COMSAT', 'COMSAT. COMSAT', text) # missing space at end of sentence
    #text = re.sub(r'Audio/Video', 'Audio-Video', text) # missing space at end of sentence

    return text.strip()

So let's have a look at the result of our data cleaning steps :

In [None]:
# that's what the substitutions do
texts = [
"""Trafalgar House Plc &lt;TRAF.L> said it has\n  acquired the entire share capital 
of &lt;Capital Homes Inc> of the\n  U.S. For 20 mln dlrs in cash.""",
"""Equiticorp Holdings Ltd &lt;EQUW.WE> now owns\n  or has received acceptances 
representing 59.93 pct of the\n  issued ordinary share capital of 
Guinness Peat Group Plc\n  &lt;GNSP.L>, Equiticorp said in a statement.""",
"""Computer Terminal Systems Inc said it has completed the sale of 200,000 shares 
of its common stock, and warrants to acquire an additional one mln shares, 
to "Sedio N.V." of Lugano, Switzerland for 50,000 dlrs.""",
"""North American Group Ltd said it has a definitive agreement 
to buy 100  pct of Pioneer Business Group Inc of Atlanta.""" 
]

for text in texts:
    print(clean(text), end="\n\n")

We apply it to the `raw_text` and create a new `text` column:

In [None]:
df['text'] = df['raw_text'].progress_map(clean)
df['headline'] = df['headline'].progress_map(clean)

The newly created column `text` contains the cleaned articles. But we have one disturbing artifact left in the data: a few articles, like the second one in the sample above, consist only of capital letters. In fact, here the raw text is identical to the headlines. We finally drop those because named entity recognition will not yield useful results on such a text.

In [None]:
# we will drop these articles with only capital letters
df[df['raw_text'].map(lambda t: t.isupper())][['headline', 'raw_text']].head(3)

In [None]:
# drop articles with only capital letters
df = df[df['raw_text'].map(lambda t: not t.isupper())]

In [None]:
# this is our clean data set
df[['headline', 'text']].sample(3, random_state=12)

In [None]:
pd.options.display.max_colwidth = 200

## Book section continues ...

# Named-Entity Recognition


In [None]:
nlp = spacy.load('en_core_web_sm') 
print(*nlp.pipeline, sep='\n')

In [None]:
text = """Hughes Tool Co Chairman W.A. Kistler said its merger with 
Baker International Corp was still under consideration.
We hope to come soon to a mutual agreement, Kistler said.
The directors of Baker filed a law suit in Texas to force Hughes 
to complete the merger."""
text = re.sub(r'\s+', ' ', text).strip() ###
doc = nlp(text)

print(*[(e.text, e.label_) for e in doc.ents], sep=' ')

In [None]:
from spacy import displacy
displacy.render(doc, style='ent')

## Blueprint: Rule-based Named-Entity Recognition


In [None]:
reset_pipeline(nlp, pipes=[])

In [None]:
from spacy.pipeline import EntityRuler

departments = ['Justice', 'Transportation']
patterns = [{"label": "GOV", 
             "pattern": [{"TEXT": "U.S.", "OP": "?"},
                         {"TEXT": "Department"}, {"TEXT": "of"}, 
                         {"TEXT": {"IN": departments}, "ENT_TYPE": "ORG"}]},
             {"label": "GOV", 
              "pattern": [{"TEXT": "U.S.", "OP": "?"},
                          {"TEXT": {"IN": departments}, "ENT_TYPE": "ORG"},
                          {"TEXT": "Department"}]},
             {"label": "GOV",
              "pattern": [{"TEXT": "Securities"}, {"TEXT": "and"},
                          {"TEXT": "Exchange"}, {"TEXT": "Commission"}]}]

In [None]:
# not in book, but useful if you modify the rules
if nlp.has_pipe('entity_ruler'):
    nlp.remove_pipe('entity_ruler')

In [None]:
entity_ruler.name

In [None]:
entity_ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True)
nlp.add_pipe('entity_ruler')

In [None]:
text = """Justice Department is an alias for the U.S. Department of Justice.
Department of Transportation and the Securities and Exchange Commission
are government organisations, but the Sales Department is not."""
#text = re.sub(r'\s+', ' ', text).strip() ###

doc = nlp(text)
# print(*[([t.text for t in e], e.label_) for e in doc.ents], sep='\n') ###
displacy.render(doc, style='ent', jupyter=True)

## Blueprint: Normalizing Named-Entities


In [None]:
reset_pipeline(nlp, [entity_ruler])

In [None]:
text = "Baker International's shares climbed on the New York Stock Exchange."

doc = nlp(text)
print(*[([t.text for t in e], e.label_) for e in doc.ents], sep='\n')

Bug fix (2021-02-25): added condition `len(ent) > 0` because entity contains just a determiner (which does not make sense in practise)

In [None]:
from spacy.tokens import Span

def norm_entities(doc):
    ents = []
    for ent in doc.ents:
        if ent[0].pos_ == "DET": # leading article
            ent = Span(doc, ent.start+1, ent.end, label=ent.label)
        if len(ent) > 0:
            if ent[-1].pos_ == "PART": # trailing particle like 's
                ent = Span(doc, ent.start, ent.end-1, label=ent.label)
            if len(ent) > 0:
                ents.append(ent)
    doc.ents = tuple(ents)
    return doc

In [None]:
nlp.add_pipe(norm_entities)

In [None]:
doc = nlp(text)
print(*[([t.text for t in e], e.label_) for e in doc.ents], sep='\n')

In [None]:
# not in book
displacy.render(doc, style='ent', jupyter=True)

## Merging Entity Tokens


In [None]:
from spacy.pipeline import merge_entities
if nlp.has_pipe('merge_entities'): ###
    _ = nlp.remove_pipe('merge_entities') ###
nlp.add_pipe(merge_entities)

doc = nlp(text)
print(*[(t.text, t.ent_type_) for t in doc if t.ent_type_ != ''])

## Testing the NER Pipeline on Sample Data (not in book)

Take random samples from the text and display the result.

In [None]:
reset_pipeline(nlp, [entity_ruler, norm_entities, merge_entities])

In [None]:
i = df['text'].sample(1).index[0]
print("Text Number:", i)

text = df['text'].loc[i][:600]
text = re.sub(r'\s+', ' ', text.strip())

print(text)
doc = nlp(text)
displacy.render(doc, style='ent', jupyter=True)
print(*[([t.text for t in e], e.label_) for e in doc.ents], sep='\n')

In [None]:
pd.options.display.max_rows = 500

# blueprint function to show tokens with entity attributes
display_ner(doc, include_punct=True).query('ent_type != ""')

In [None]:
pd.options.display.max_rows = 60

# Coreference Resolution


## Blueprint: Using spaCy's Token Extensions


In [None]:
# not in book, but usefule if you modify the extension
from spacy.tokens import Token

if Token.has_extension('ref_n'):
    _ = Token.remove_extension('ref_n') 
if Token.has_extension('ref_t'):
    _ = Token.remove_extension('ref_t') 
if Token.has_extension('ref_t_'):
    _ = Token.remove_extension('ref_t_')

In [None]:
from spacy.tokens import Token
Token.set_extension('ref_n', default='')
Token.set_extension('ref_t', default='')

In [None]:
def init_coref(doc):
    for e in doc.ents:
        if e.label_ in ['ORG', 'GOV', 'PERSON']:
            e[0]._.ref_n, e[0]._.ref_t = e.text, e.label_
    return doc

## Blueprint: Alias Resolution


In [None]:
from blueprints.knowledge import alias_lookup

for token in ['Transportation Department', 'DOT', 'SEC', 'TWA']:
    print(token, ':', alias_lookup[token])

In [None]:
reset_pipeline(nlp, [entity_ruler, norm_entities, merge_entities, init_coref])

In [None]:
def alias_resolver(doc):
    """Lookup aliases and store result in ref_t, ref_n"""
    for ent in doc.ents:
        token = ent[0].text
        if token in alias_lookup:
            a_name, a_type = alias_lookup[token]
            ent[0]._.ref_n, ent[0]._.ref_t = a_name, a_type
    return propagate_ent_type(doc)

In [None]:
def propagate_ent_type(doc):
    """propagate entity type stored in ref_t"""
    ents = []
    for e in doc.ents:
        if e[0]._.ref_n != '': # if e is a coreference
            e = Span(doc, e.start, e.end, label=e[0]._.ref_t)
        ents.append(e)
    doc.ents = tuple(ents)
    return doc

In [None]:
nlp.add_pipe(alias_resolver)

In [None]:
from blueprints.knowledge import display_ner
text = """The deal of Trans World Airlines is under investigation by the
U.S. Department of Transportation.
The Transportation Department will block the deal of TWA."""
text = re.sub(r'\s+', ' ', text).strip() ###
doc = nlp(text)
display_ner(doc).query("ref_n != ''")[['text', 'ent_type', 'ref_n', 'ref_t']]

## Blueprint: Resolving Name Variations


In [None]:
reset_pipeline(nlp, [entity_ruler, norm_entities, merge_entities, init_coref, alias_resolver])

In [None]:
text = """
Hughes Tool Co Chairman W.A. Kistler said its merger with 
Baker International Corp. was still under consideration.
We hope to come to a mutual agreement, Kistler said.
Baker will force Hughes to complete the merger.
"""
text = re.sub(r'\s+', ' ', text).strip() ### 

doc = nlp(text) 
displacy.render(doc, style='ent', jupyter=True)

In [None]:
def name_match(m1, m2):
    m2 = re.sub(r'[()\.]', '', m2) # ignore parentheses and dots
    m2 = r'\b' + m2 + r'\b' # \b marks word boundary
    m2 = re.sub(r'\s+', r'\\b.*\\b', m2)
    return re.search(m2, m1, flags=re.I) is not None

In [None]:
def name_resolver(doc):
    """create name-based reference to e1 as primary mention of e2"""
    ents = [e for e in doc.ents if e.label_ in ['ORG', 'PERSON']]
    for i, e1 in enumerate(ents):
        for e2 in ents[i+1:]:
            if name_match(e1[0]._.ref_n, e2[0].text): 
                e2[0]._.ref_n = e1[0]._.ref_n
                e2[0]._.ref_t = e1[0]._.ref_t
    return propagate_ent_type(doc)

In [None]:
nlp.add_pipe(name_resolver)

doc = nlp(text)
displacy.render(doc, style='ent', jupyter=True)

In [None]:
display_ner(doc).query("ref_n != ''")[['text', 'ent_type', 'ref_n', 'ref_t']]

## Testing Name Coreference Resolution Sample Data (not in book)

Take random samples from the text and display the result. You may find examples where the resolution is not working correctly. We have put the emphasis on the simplicity of rules, so there will be cases in which they don't work.

In [None]:
reset_pipeline(nlp, [entity_ruler, norm_entities, merge_entities, init_coref, alias_resolver, name_resolver])

In [None]:
# not in the book:
# pick random examples to test the string matching

i = df['text'].sample(1).index[0]
i = 10
print("Text Number:", i)

text = df['text'].loc[i]#[:300]
# print(text)

doc = nlp(text)
displacy.render(doc, style='ent', jupyter=True)

display_ner(doc).query("ref_n != ''")

## Blueprint: Anaphora Resolution with NeuralCoref


In [None]:
text = """Hughes Tool Co said its merger with Baker
was still under consideration. Hughes had a board meeting today.
W.A. Kistler mentioned that the company hopes for a mutual agreement.
He is reasonably confident."""
text = re.sub(r'\s+', ' ', text).strip() ###

In [None]:
reset_pipeline(nlp, [entity_ruler, norm_entities, merge_entities, 
                     init_coref, alias_resolver, name_resolver])

In [None]:
from neuralcoref import NeuralCoref
neural_coref = NeuralCoref(nlp.vocab, greedyness=0.45)
nlp.add_pipe(neural_coref, name='neural_coref')

In [None]:
doc = nlp(text)
print(*doc._.coref_clusters, sep='\n')

Not in the book: Try the visualization of NeuralCoref!

https://huggingface.co/coref/?text=Hughes%20Tool%20Co%20said%20its%20merger%20with%20Baker%20was%20still%20under%20consideration.%20

In [None]:
def anaphor_coref(doc):
    """anaphora resolution"""
    for token in doc:
        # if token is coref and not already dereferenced
        if token._.in_coref and token._.ref_n == '': 
            ref_span = token._.coref_clusters[0].main # get referred span
            if len(ref_span) <= 3: # consider only short spans
                for ref in ref_span: # find first dereferenced entity
                    if ref._.ref_n != '':
                        token._.ref_n = ref._.ref_n
                        token._.ref_t = ref._.ref_t
                        break
    return doc

In [None]:
if nlp.has_pipe('anaphor_coref'): ###
    nlp.remove_pipe('anaphor_coref') ###
nlp.add_pipe(anaphor_coref)
doc = nlp(text)
display_ner(doc).query("ref_n != ''") \
  [['text', 'ent_type', 'main_coref', 'ref_n', 'ref_t']]

## Name Normalization


In [None]:
def strip_legal_suffix(text):
    return re.sub(r'(\s+and)?(\s+|\b(Co|Corp|Inc|Plc|Ltd)\b\.?)*$', '', text)

print(strip_legal_suffix('Hughes Tool Co'))

In [None]:
def norm_names(doc):
    for t in doc:
        if t._.ref_n != '' and t._.ref_t in ['ORG']:
            t._.ref_n = strip_legal_suffix(t._.ref_n)
            if t._.ref_n == '':
                t._.ref_t = ''
                
    return doc

In [None]:
nlp.add_pipe(norm_names)

## Entity Linking


## Testing Coreference Resolution (not in book)

Not in the book, but a good demonstration of what works good and what doesn't work, yet.

In [None]:
# recreate pipeline
reset_pipeline(nlp, [entity_ruler, norm_entities, merge_entities, 
                     init_coref, alias_resolver, name_resolver,
                     neural_coref, anaphor_coref, norm_names])

In [None]:
# pick random examples and test

i = df['text'].sample(1).index[0]
i = 2948 # 1862, 1836,2948,7650,3013,2950,3095
print("Text Number:", i)

text = df['text'].loc[i][:500]
print(text)

doc = nlp(text)
displacy.render(doc, style='ent', jupyter=True)

display_ner(doc).query("ref_n != ''")[['text', 'ent_type', 'main_coref', 'ref_n', 'ref_t']]

# Blueprint: Creating a Cooccurence Graph


**Largest connected component of the cooccurrence graph generated from the Reuters corpus**  
The visualization was prepared with the help of [Gephi](https://gephi.org/).
<img src="figures/cooc.png" width="80%"/>

## Extracting Cooccurrences from a Document


In [None]:
from itertools import combinations

def extract_coocs(doc, include_types):
    ents = set([(e[0]._.ref_n, e[0]._.ref_t) 
                for e in doc.ents if e[0]._.ref_t in include_types])
    yield from combinations(sorted(ents), 2)

In [None]:
reset_pipeline(nlp, [entity_ruler, norm_entities, merge_entities, 
                     init_coref, alias_resolver, name_resolver,
                     neural_coref, anaphor_coref, norm_names])

In [None]:
batch_size = 100
batches = math.ceil(len(df)/batch_size) ###

coocs = []
for i in tqdm(range(0, len(df), batch_size), total=batches):
    docs = nlp.pipe(df['text'][i:i+batch_size],
                    disable=['neural_coref', 'anaphor_coref'])
    for j, doc in enumerate(docs):
        coocs.extend([(df.index[i+j], *c) 
                      for c in extract_coocs(doc, ['ORG', 'GOV'])])

In [None]:
print(*coocs[:3], sep='\n')

In [None]:
coocs = [([id], *e1, *e2) for (id, e1, e2) in coocs]
cooc_df = pd.DataFrame.from_records(coocs, 
             columns=('article_id', 'ent1', 'type1', 'ent2', 'type2')) 
cooc_df = cooc_df.groupby(['ent1', 'type1', 'ent2', 'type2'])['article_id'] \
                 .agg(['count', 'sum']) \
                 .rename(columns={'count': 'freq', 'sum': 'articles'}) \
                 .reset_index().sort_values('freq', ascending=False)
cooc_df['articles'] = cooc_df['articles'].map(
                        lambda lst: ','.join([str(a) for a in lst[:5]]))

In [None]:
cooc_df.head(3)

## Visualizing the Graph with Gephi


In [None]:
import networkx as nx

graph = nx.from_pandas_edgelist(
           cooc_df[['ent1', 'ent2', 'articles', 'freq']] \
           .query('freq > 3').rename(columns={'freq': 'weight'}),
           source='ent1', target='ent2', edge_attr=True)

nx.readwrite.write_gexf(graph, 'cooc.gexf', encoding='utf-8', 
                        prettyprint=True, version='1.2draft')

## Visualizing the Graph with NetworkX (not in book)

We can also use NetworkX for drawing, it's just not that nice. By executing the code below you will see more nodes than in the book, where we manually removed several nodes for the sake of clarity.

In [None]:
# identify the greatest component (connected subgraph)
# and plot only that one
giant_component = sorted(nx.connected_components(graph), key=len, reverse=True)
graph = graph.subgraph(giant_component[0])

pos = nx.kamada_kawai_layout(graph, weight='weight')
# pos = nx.fruchterman_reingold_layout(graph, weight='weight')
# pos = nx.circular_layout(graph)

_ = plt.figure(figsize=(20, 20))
nx.draw(graph, pos, 
        node_size=1000, 
        node_color='skyblue',
        alpha=0.8,
        with_labels = True)
plt.title('Graph Visualization', size=15)

for (node1,node2,data) in graph.edges(data=True):
    width = data['weight'] 
    _ = nx.draw_networkx_edges(graph,pos,
                               edgelist=[(node1, node2)],
                               width=width,
                               edge_color='#505050',
                               alpha=0.5)

plt.show()

## Blueprint: Identifying Acronyms (not in book)

It is very easy to generate a very good list of suggestions for acronyms if you search for frequent cooccurrences of acronyms. 

To find possible acronyms in the cooccurrence data frame, we look for all tuples that have an acronym (all capital letters) either as source or as target. As additional conditions, we require that the first letter in both is the same and the combination exists more than once.

In [None]:
reset_pipeline(nlp, [entity_ruler, norm_entities, merge_entities, 
                     init_coref, name_resolver, norm_names]) # no alias resolver

In [None]:
batch_size = 100
batches = math.ceil(len(df)/batch_size) ###

coocs = []
for i in tqdm(range(0, len(df), batch_size), total=batches):
    docs = nlp.pipe(df['text'][i:i+batch_size])
    for j, doc in enumerate(docs):
        coocs.extend([(df.index[i+j], *c) for c in extract_coocs(doc, ['ORG', 'GOV'])])

coocs = [([id], *e1, *e2) for (id, e1, e2) in coocs]

In [None]:
cooc_df = pd.DataFrame.from_records(coocs, 
             columns=('article_id', 'ent1', 'type1', 'ent2', 'type2')) 
cooc_df = cooc_df.groupby(['ent1', 'ent2'])['article_id'] \
                 .agg(['count']).rename(columns={'count': 'freq'}) \
                 .reset_index().sort_values('freq', ascending=False)

acro_pattern = (cooc_df['ent1'].str.isupper() | cooc_df['ent2'].str.isupper()) & \
               (cooc_df['ent1'].str[:1] == cooc_df['ent2'].str[:1]) & \
               (cooc_df['freq'] > 1)

print(len(cooc_df[acro_pattern]))
cooc_df[acro_pattern][:10]

For our corpus, this yields about 40 potential acronyms.

We save them to a file:

In [None]:
# export to csv
cooc_df[acro_pattern][['ent1', 'ent2']] \
  .sort_values(['ent1', 'ent2']) \
  .to_csv('possible_acronyms.txt', index=False)

This file has to be curated manually. After cleaning, we load the remaining acronyms and convert them to a dictionary:

In [None]:
# curate manually the csv
acro_df = pd.read_csv('possible_acronyms.txt')
acro_df.set_index('ent1')['ent2'].to_dict()

We took this list, and curated it to create a dictionary that maps acronyms to their long names. It is  provided in the blueprints package for this chapter and part of `alias_lookup`. Here are some example entries:

In [None]:
from blueprints.knowledge import _acronyms

for acro in ['TWA', 'UCPB', 'SEC', 'DOT']:
    print(acro, ' --> ', alias_lookup[acro])

# Relation Extraction


## Blueprint: Relation Extraction by Phrase Matching


In [None]:
# use large model, otherwise the examples look different!
# to make it work on Colab, we need to import the model directly
# usually you would use nlp = spacy.load('en_core_web_lg') 
import en_core_web_lg
nlp = en_core_web_lg.load()

# need to re-create the entity ruler after reloading nlp
# because new entity type 'GOV' needs to be added to nlp.vocab
entity_ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True)

In [None]:
# recreate pipeline
reset_pipeline(nlp, [entity_ruler, norm_entities, merge_entities, 
                     init_coref, alias_resolver, name_resolver, norm_names,
                     neural_coref, anaphor_coref])

In [None]:
text = """Fujitsu plans to acquire 80% of Fairchild Corp, an industrial unit
of Schlumberger."""
text = re.sub('\s+', ' ', text).strip() ###

In [None]:
doc = nlp(text)

displacy.render(doc, style='ent', jupyter=True)

In [None]:
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)

acq_synonyms = ['acquire', 'buy', 'purchase']
pattern = [{'_': {'ref_t': 'ORG'}}, # subject
           {'_': {'ref_t': {'NOT_IN': ['ORG']}}, 'OP': '*'},
           {'POS': 'VERB', 'LEMMA': {'IN': acq_synonyms}},
           {'_': {'ref_t': {'NOT_IN': ['ORG']}}, 'OP': '*'},
           {'_': {'ref_t': 'ORG'}}] # object
matcher.add('acquires', None, pattern)

In [None]:
subs_synonyms = ['subsidiary', 'unit']
pattern = [{'_': {'ref_t': 'ORG'}}, # subject
           {'_': {'ref_t': {'NOT_IN': ['ORG']}}, 
            'POS': {'NOT_IN': ['VERB']}, 'OP': '*'},
           {'LOWER': {'IN': subs_synonyms}}, {'TEXT': 'of'},
           {'_': {'ref_t': {'NOT_IN': ['ORG']}}, 
            'POS': {'NOT_IN': ['VERB']}, 'OP': '*'},
           {'_': {'ref_t': 'ORG'}}] # object
matcher.add('subsidiary-of', None, pattern)

In [None]:
def extract_rel_match(doc, matcher):
    for sent in doc.sents:
        for match_id, start, end in matcher(sent):
            span = sent[start:end]  # matched span
            pred = nlp.vocab.strings[match_id] # rule name
            subj, obj = span[0], span[-1]
            if pred.startswith('rev-'): # reversed relation
                subj, obj = obj, subj
                pred = pred[4:]
            yield ((subj._.ref_n, subj._.ref_t), pred, 
                   (obj._.ref_n, obj._.ref_t))

In [None]:
pattern = [{'_': {'ref_t': 'ORG'}}, # subject
           {'LOWER': {'IN': subs_synonyms}}, # predicate
           {'_': {'ref_t': 'ORG'}}] # object
matcher.add('rev-subsidiary-of', None, pattern)

In [None]:
text = """Fujitsu plans to acquire 80% of Fairchild Corp, an industrial unit 
of Schlumberger. The Schlumberger unit Fairchild Corp received an offer."""
text = re.sub('\s+', ' ', text) ###
doc = nlp(text)
print(*extract_rel_match(doc, matcher), sep='\n')

In [None]:
text = "Fairchild Corp was acquired by Fujitsu."
print(*extract_rel_match(nlp(text), matcher), sep='\n')

In [None]:
text = "Fujitsu, a competitor of NEC, acquired Fairchild Corp."
print(*extract_rel_match(nlp(text), matcher), sep='\n')

In [None]:
if matcher.has_key("acquires"): 
    matcher.remove("acquires")

## Blueprint: Relation Extraction using Dependency Trees


In [None]:
# recreate pipeline
reset_pipeline(nlp, [norm_entities, merge_entities, 
                     init_coref, alias_resolver, name_resolver, norm_names,
                     neural_coref, anaphor_coref])

In [None]:
text = "Fujitsu, a competitor of NEC, acquired Fairchild Corp."
doc = nlp(text)
displacy.render(doc, style='dep', jupyter=True, 
                options={'compact': False, 'distance': 100})

In [None]:
text = "Fairchild Corp was acquired by Fujitsu."
doc = nlp(text)
displacy.render(doc, style='dep', jupyter=True, options={'compact': False, 'distance': 100})

In [None]:
# Here is the longer part of the code, that was skipped in the book.

# Actually we search for the shortest path between the
# subject running through our predicate (verb) to the object.
# subject and object are organizations in our examples.

# Here are the three helper functions omitted in the book:
# - bfs: breadth first searching the closest subject/object 
# - is_passive: checks if noun or verb is in passive form
# - find_subj: searches left part of tree for subject
# - find_obj: searches right part of tree for object

from collections import deque

def bfs(root, ent_type, deps, first_dep_only=False):
    """Return first child of root (included) that matches
    ent_type and dependency list by breadth first search.
    Search stops after first dependency match if first_dep_only
    (used for subject search - do not "jump" over subjects)"""
    to_visit = deque([root]) # queue for bfs

    while len(to_visit) > 0:
        child = to_visit.popleft()
        # print("child", child, child.dep_)
        if child.dep_ in deps:
            if child._.ref_t == ent_type:
                return child
            elif first_dep_only: # first match (subjects)
                return None
        elif child.dep_ == 'compound' and \
             child.head.dep_ in deps and \
             child._.ref_t == ent_type: # check if contained in compound
            return child
        to_visit.extend(list(child.children))
    return None

def is_passive(token):
    if token.dep_.endswith('pass'): # noun
        return True
    for left in token.lefts: # verb
        if left.dep_ == 'auxpass':
            return True
    return False

def find_subj(pred, ent_type, passive):
    """Find closest subject in predicates left subtree or
    predicates parent's left subtree (recursive).
    Has a filter on organizations."""
    for left in pred.lefts:
        if passive: # if pred is passive, search for passive subject
            subj = bfs(left, ent_type, ['nsubjpass', 'nsubj:pass'], True)
        else:
            subj = bfs(left, ent_type, ['nsubj'], True)
        if subj is not None: # found it!
            return subj
    if pred.head != pred and not is_passive(pred): 
        return find_subj(pred.head, ent_type, passive) # climb up left subtree
    else:
        return None

def find_obj(pred, ent_type, excl_prepos):
    """Find closest object in predicates right subtree.
    Skip prepositional objects if the preposition is in exclude list.
    Has a filter on organizations."""
    for right in pred.rights:
        obj = bfs(right, ent_type, ['dobj', 'pobj', 'iobj', 'obj', 'obl'])
        if obj is not None:
            if obj.dep_ == 'pobj' and obj.head.lemma_.lower() in excl_prepos: # check preposition
                continue
            return obj
    return None

In [None]:
def extract_rel_dep(doc, pred_name, pred_synonyms, excl_prepos=[]):
    for token in doc:
        if token.pos_ == 'VERB' and token.lemma_ in pred_synonyms:
            pred = token
            passive = is_passive(pred)
            subj = find_subj(pred, 'ORG', passive)
            if subj is not None:
                obj = find_obj(pred, 'ORG', excl_prepos)
                if obj is not None:
                    if passive: # switch roles
                        obj, subj = subj, obj
                    yield ((subj._.ref_n, subj._.ref_t), pred_name, 
                           (obj._.ref_n, obj._.ref_t))

In [None]:
text = """Fujitsu said that Schlumberger Ltd has arranged 
to sell its stake in Fairchild Inc."""
doc = nlp(text)
print(*extract_rel_dep(doc, 'sells', ['sell']), sep='\n')

In [None]:
text = "Schlumberger Ltd has arranged to sell to Fujitsu its stake in Fairchild Inc."
doc = nlp(text)
print(*extract_rel_dep(doc, 'sells', ['sell']), sep='\n')
displacy.render(doc, style='dep', jupyter=True, options={'compact': False, 'distance': 80})

In [None]:
print("A:", *extract_rel_dep(doc, 'sells', ['sell']))
print("B:", *extract_rel_dep(doc, 'sells', ['sell'], ['to', 'from']))

In [None]:
texts = [ 
     "Fairchild Corp was bought by Fujitsu.", # 1
     "Fujitsu, a competitor of NEC Co, acquired Fairchild Inc.", # 2
     "Fujitsu is expanding." + 
     "The company made an offer to acquire 80% of Fairchild Inc.", # 3
     "Fujitsu plans to acquire 80% of Fairchild Corp.", # 4
     "Fujitsu plans not to acquire Fairchild Corp.", # 5
     "The competition forced Fujitsu to aquire Fairchild Corp." # 6
]

acq_synonyms = ['acquire', 'buy', 'purchase']
for i, text in enumerate(texts):
    doc = nlp(text)
    rels = extract_rel_dep(doc, 'acquires', acq_synonyms, ['to', 'from'])
    print(f'{i+1}:', *rels)

# Creating the Knowledge Graph


**On Colab**: Choose "Runtime"&rarr;"Change Runtime Type"&rarr;"GPU" to benefit from the GPUs.

In [None]:
if spacy.prefer_gpu():
    print("Working on GPU.")
else:
    print("No GPU found, working on CPU.")

nlp = en_core_web_lg.load()

In [None]:
# need to re-create the entity ruler after reloading nlp
# because new entity type 'GOV' needs to be added to nlp.vocab
entity_ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True)

In [None]:
pipes = [entity_ruler, norm_entities, merge_entities,
         init_coref, alias_resolver, name_resolver, 
         neural_coref, anaphor_coref, norm_names]
for pipe in pipes:
    nlp.add_pipe(pipe)

In [None]:
# recreate matcher - same definition as above for these rules
matcher = Matcher(nlp.vocab)

subs_synonyms = ['subsidiary', 'unit']
pattern = [{'_': {'ref_t': 'ORG'}}, # subject
           {'_': {'ref_t': {'NOT_IN': ['ORG']}}, 'POS': {'NOT_IN': ['VERB']}, 'OP': '*'},
           {'LOWER': {'IN': subs_synonyms}}, # predicate
           {'TEXT': 'of'},
           {'_': {'ref_t': {'NOT_IN': ['ORG']}}, 'POS': {'NOT_IN': ['VERB']}, 'OP': '*'},
           {'_': {'ref_t': 'ORG'}}] # object
matcher.add('subsidiary-of', None, pattern)

pattern = [{'_': {'ref_t': 'ORG'}}, # subject
           {'POS': 'PART', 'OP': '?'},
           {'LOWER': {'IN': subs_synonyms}}, # predicate
           {'_': {'ref_t': 'ORG'}}] # object
matcher.add('rev-subsidiary-of', None, pattern)

In [None]:
ceo_synonyms = ['chairman', 'president', 'director', 'ceo', 'executive']
pattern = [{'ENT_TYPE': 'PERSON'},
           {'ENT_TYPE': {'NOT_IN': ['ORG', 'PERSON']}, 'OP': '*'}, 
           {'LOWER': {'IN': ceo_synonyms}}, {'TEXT': 'of'},
           {'ENT_TYPE': {'NOT_IN': ['ORG', 'PERSON']}, 'OP': '*'}, 
           {'ENT_TYPE': 'ORG'}] 
matcher.add('executive-of', None, pattern)

pattern = [{'ENT_TYPE': 'ORG'}, 
           {'LOWER': {'IN': ceo_synonyms}},
           {'ENT_TYPE': 'PERSON'}] 
matcher.add('rev-executive-of', None, pattern)

In [None]:
def extract_rels(doc):
    yield from extract_rel_match(doc, matcher)
    yield from extract_rel_dep(doc, 'acquires', acq_synonyms, ['to', 'from'])
    yield from extract_rel_dep(doc, 'sells', ['sell'], ['to', 'from'])

## Testing Relationship Extraction (not in book)

In [None]:
text = """Allied-Signal Inc and Schlumberger Ltd jointly announced 
that Schlumberger had acquired Allied-Signal's unit Neptune International. 
"""
#text = df.text.loc[19975]

text = re.sub(r'\s+', ' ', text).strip()
print(*textwrap.wrap(text, 100), sep='\n')
print()
doc = nlp(text, disable='entity_ruler')
#displacy.render(doc, style='ent')
print(*extract_rels(doc), sep='\n')
displacy.render(doc, style='dep', jupyter=True, options={'compact': False, 'distance': 100})

## Extraction of Entities and Relations and Creation of Gephi-File (not in book)

Batch-processing for entity extraction with subsequent relation extraction. Takes about 5 minutes,  80% of runtime for NeuralCoref.

In [None]:
from math import ceil

batch_size = 20
batches = ceil(len(df) / batch_size) ###

rels = []
for i in tqdm(range(0, len(df), batch_size), total=batches):
    docs = nlp.pipe(df['text'][i:i+batch_size])
    for j, doc in enumerate(docs):
        rels.extend([(df.index[i+j], *r) for r in extract_rels(doc)])

Creation of the relation data frame including final curation:

In [None]:
# unpack subject and object 
rels = [(a_id, *subj, pred, *obj) for (a_id, subj, pred, obj) in rels]

# create data frame
rel_df = pd.DataFrame.from_records(rels, columns=('article_id', 'subj', 'subj_type', 'pred', 'obj', 'obj_type'))

# false positives: subject cannot be object
rel_df = rel_df.query('subj != obj')

# filter entities that were not correctly detected
# tokenizer produces "-owned XYZ company"
rel_df = rel_df[~rel_df['subj'].str.startswith('-own')]
rel_df = rel_df[~rel_df['obj'].str.startswith('-own')]

# drop duplicate relations (within an article)
rel_df = rel_df.drop_duplicates()

# aggregate to produce one record per relation
rel_df['article_id'] = rel_df['article_id'].map(lambda a: [a])
rel_df = rel_df.groupby(['subj', 'subj_type', 'pred', 'obj', 'obj_type'])['article_id'] \
                 .agg(['count', 'sum']) \
                 .rename(columns={'count': 'freq', 'sum': 'articles'}) \
                 .reset_index().sort_values('freq', ascending=False)

rel_df['articles'] = rel_df['articles'].map(lambda lst: ','.join(list(set([str(a) for a in lst]))))
rel_df.head(10)

In [None]:
# some statitics
rel_df['pred'].value_counts()

In [None]:
# try searching for a specific entity
search = "Trans World"
rel_df[(rel_df.subj.str.lower().str.contains(search.lower()) | 
        rel_df.obj.str.lower().str.contains(search.lower()))]

In [None]:
# in fact, TWA acquires and sells parts of USAir according to the messages
# look at a specific article
text = df['text'][9487]
print(*textwrap.wrap(text, 80), sep='\n')

To create the NetworkX graph be careful: We need a `MultiDiGraph` here, a directed graph allowing multiple edges between two nodes!

In [None]:
import networkx as nx
from networkx import MultiDiGraph

graph = MultiDiGraph()
for i, row in rel_df.iterrows():
    graph.add_node(row['subj'], Type=row['subj_type'])
    graph.add_node(row['obj'], Type=row['obj_type'])
    _ = graph.add_edge(row['subj'], row['obj'], 
                   Articles=row['articles'], Rel=row['pred'])
   
nx.readwrite.write_gexf(graph, 'knowledge_graph.gexf', encoding='utf-8', 
                         prettyprint=True, version='1.2draft')

Choose merge strategy "last" when you load the data into Gephi, as relations with highest counts come last in the gexf file.

## Book section continues ...

### Don't Blindly Trust the Results


# Closing Remarks


# Further Reading
