#### Designing a Knowledge Graph Schema
Before jumping into data ingestion, we must consider the structure of our knowledge graph. For our use case, we're interested in connecting related documents and concepts.

In terms of nodes, we have both abstracts and terms. Our abstracts have only an ID and text. Our
terms will need to be similar, albeit the text is shorter. Due to this, our node properties are simple,
with only two different properties needed per node.

The existence of two types of entities in our hypothetical graph, abstracts and terms, means that our graph is heterogeneous (containing multiple types of objects or multiple types of links).

##### Linking text to terms
Now that we know the graph schema we are aiming for, we need to isolate our nodes and edges, ready for graph construction. We already have our abstracts, which will make up one node type, but we have yet to identify terms.

Our aim is to extract biomedical terms from abstracts to use as nodes in our graph. We therefore need a method to extract biomedical terms specifically.

Let's get started by ingesting our cleaned data into our knowledge graph:

In [1]:
import csv

with open('./data/20k_abstracts_clean.csv', 'r') as c:
    reader = csv.reader(c)
    data = [line for line in reader]

Next, we need to load our corpus of biomedical language. For this example, we will use the medical abstracts corpora, commonly known as en_core_sci_sm, as these provide us with the nodes and edge lists to build our knowledge graphs.

In [2]:
#import scispacy
import spacy
nlp = spacy.load("en_core_sci_sm")



Let’s take a look at what biomedical language we can use in the scispacy module to find in
our first abstract. We can access the first abstract with a list index, selecting the second element
of the first row of our list of lists. Then, using our nlp object’s inbuilt method, we can analyze our text using the corpus we loaded previously.

In [3]:
text = data[0][1]
doc = nlp(text)
print(list(doc.ents))

[IgE, sensitization, Aspergillus fumigatus, positive, sputum fungal, patients, refractory asthma, patients, antifungal treatment, voriconazole, asthma-related outcomes, patients, asthma, IgE, sensitized, A fumigatus, Asthmatic, patients, IgE, sensitized, fumigatus, history, severe, exacerbations, months, treated, months, voriconazole, observation, months, double-blind, placebo-controlled, randomized design, Primary outcomes, improvement, quality of life, treatment, period, reduction, severe, exacerbations, months, study, Sixty-five patients, randomized, patients, started, treatment, voriconazole, placebo, intention-to-treat analysis, patients, months, medication, voriconazole, placebo groups, severe, exacerbations, patient, CI, quality of life, Asthma, Quality of Life Questionnaire score, groups, CI, secondary outcome, effect, months, treatment, voriconazole, patients, moderate-to-severe asthma, IgE, sensitized, rate, severe, exacerbations, quality of life, markers, asthma control]


We can see that technical-looking entities such as IgE and asthma are being extracted, which
are the type of biomedical language terms we need to connect our abstracts to. However, we can
also see entities such as positive and severe, which, while used in technical literature,
are not specific enough to inform us about the contents of the abstract. Also, we see that some
entities have capital letters. This would potentially prevent linking the same term across abstracts
where the terms match but casing differs.

Let's first extract all of the biomedical entities ini each abstract, using a similar method to the previous approacch, but now ini a list comprehension. The second element of each row contains an abstract's text, so that is what we feed into the nlp() method.

In [4]:
abstract_entities = [[row[0], nlp(row[1]).ents] for row in data]

We can now deal with the casing issue we encountered when looking at the first abstract. Writing another list comprehension to convert each entity into a string term and then use the lower() method to convert each term to lowercase.

In [5]:
abstract_entities = [[row[0], [str(ent).lower() for ent in row[1]]] for row in abstract_entities]
print(abstract_entities[:5])

[['0', ['ige', 'sensitization', 'aspergillus fumigatus', 'positive', 'sputum fungal', 'patients', 'refractory asthma', 'patients', 'antifungal treatment', 'voriconazole', 'asthma-related outcomes', 'patients', 'asthma', 'ige', 'sensitized', 'a fumigatus', 'asthmatic', 'patients', 'ige', 'sensitized', 'fumigatus', 'history', 'severe', 'exacerbations', 'months', 'treated', 'months', 'voriconazole', 'observation', 'months', 'double-blind', 'placebo-controlled', 'randomized design', 'primary outcomes', 'improvement', 'quality of life', 'treatment', 'period', 'reduction', 'severe', 'exacerbations', 'months', 'study', 'sixty-five patients', 'randomized', 'patients', 'started', 'treatment', 'voriconazole', 'placebo', 'intention-to-treat analysis', 'patients', 'months', 'medication', 'voriconazole', 'placebo groups', 'severe', 'exacerbations', 'patient', 'ci', 'quality of life', 'asthma', 'quality of life questionnaire score', 'groups', 'ci', 'secondary outcome', 'effect', 'months', 'treatment

Now, we see that the extracted terms are all lowercase, which means when we use them to create nodes later, we won't have more than one node for terms such as Asthma and asthma.

Next, we noticed that scispacy extracts both relevant terms such as asthma and not relevant terms like positive. likely to be highly common terms among otherwise unrelated abstracts, such as
positive. In many NLP applications, the number of extracted entities that make it through
to further processing is limited. This is commonly approached by using the frequency of entity
occurrence in documents. The extracted term positive is likely to appear many times, and
for a knowledge graph, this means connecting many unrelated documents, with no real benefit.

In [6]:
# To look into this issue, we can first examine the frequency of our extracted entities accross all of the abstracts
all_entities = [row[1] for row in abstract_entities]
#all_entities[1]

all_entities contains a list of lists, where each list contains many terms. We want to look at the frequency of terms across all abstracts, so we will need to join these lists into one large list. We will use the python itertools module, and use itertool.chain.from_iterable() to convert our list of lists into one list:

In [7]:
import itertools
entities = itertools.chain.from_iterable(all_entities)
#entities[1]

With all of the terms in one list, we can count the frequency of them using another inbuilt python library, collections.Counter(), which will convert our list into a dictionary of {term: frequency} key-value pairs.

In [8]:
from collections import Counter
entity_freq = dict(Counter(entities))
entity_freq = dict(sorted(entity_freq.items(),
                          key=lambda item: item[1], reverse=True))
print(entity_freq)



We can see that patients, treatment, study, and groups had the highest frequency. These generic terms found in the research literature will not provide any value for our knowledge graph, as we will be connecting many unrelated abstracts. 

However, the lower frequency terms are much more likely to be useful to us.

At this point, we need to select a cutoff point for what high-frequency terms we will allow in our
knowledge graph. There is too much data to do this term by term, so choosing a threshold, or a
method for threshold selection, is likely to be arbitrary. For the purposes of this chapter, we will
remove any terms with a frequency of above 100, recognizing that this will remove some useful
terms, and preserve some lower-frequency generic terms. When designing processing for a real
knowledge graph, this frequency cutoff is something that might be modified in conjunction
with some downstream analysis to examine the effect on the resulting graph and identify an
optimum threshold. Let’s see how many terms we will remove by setting an upper-frequency
threshold of 100.

In [9]:
high_freq = {ent: value for ent, value in entity_freq.items() if value > 100}
print(len(high_freq))
print(len(entity_freq))

199
47667


Our first print statement shows that we will be removing 199 terms from our graph with this
upper threshold. Our second print statement shows that the total number of unique terms
extracted, before removing any, is 47,667, so our upper threshold of 100 results in removing
around 0.4% of terms.

Before removing our high-frequency entities that will be taken forward to knowledge graph
construction, we must also consider very low-frequency terms. Terms only found in one abstract
won’t connect abstracts and therefore have limited use in a knowledge graph intended to create
relationships between related documents.

We can examine how many terms occur only once in our entire set of abstracts with a similar
dictionary comprehension to the previous one used for high-frequency terms. Here, we
replace the value > 100 conditional with value == 1 to get a dictionary of terms with
a frequency of 1:

In [10]:
low_freq = {ent: value for ent, value in entity_freq.items() if value == 1}
print(len(low_freq))

29364


This shows that there are 29,364 terms that occur only once. This is magnitudes larger than
the number of highly common terms, which is fairly typical for NLP pipelines. Looking back
at our total number, we can calculate that terms with a frequency of 1 represent around 61.6%
of unique extracted terms. Due to the large number of low-frequency terms, choosing not to
include them in our downstream graph construction will improve the overall performance
during knowledge graph analysis later.

We can implement the thresholds we have selected, by first creating a list of terms we do not want to extract. Here, we use a list comprehension with conditions set to select the entity
strings of those that occur either more than 100 times or only once:

In [11]:
removed_terms = [ent for ent, value in entity_freq.items() if value > 100 or value == 1]

With our list of terms to exclude, we can now trim down the entities associated with each abstract
in our abstract_entities variable. Let’s use a list comprehension again to select terms
that are not in the removed_terms list while retaining their relationship to each individual
abstract. Then, we can print the first abstract’s newly trimmed-down entities to confirm our
method is working as expected:

In [12]:
abstract_entities = [[row[0], [ent for ent in row[1] if ent not in removed_terms]] for row in abstract_entities]
print(abstract_entities[0])

['0', ['ige', 'sensitization', 'voriconazole', 'asthma', 'ige', 'sensitized', 'asthmatic', 'ige', 'sensitized', 'history', 'exacerbations', 'voriconazole', 'observation', 'placebo-controlled', 'randomized design', 'primary outcomes', 'exacerbations', 'started', 'voriconazole', 'intention-to-treat analysis', 'medication', 'voriconazole', 'placebo groups', 'exacerbations', 'asthma', 'secondary outcome', 'voriconazole', 'ige', 'sensitized', 'exacerbations', 'markers', 'asthma control']]


Comparing the printed entities to those printed at the start of this section, we can see that some
of the very common terms have been removed, as well as some highly specific terms unlikely
to be in another abstract.

While the methods we have used for text processing here are fairly simple, NLP pipelines can also be
highly complex and sophisticated. Designing preprocessing workflows for text can be a bit of an art
and generally benefits from extensive knowledge of the subject and analysis of the text data before
any implementation. We could do more to improve the preprocessing of our abstracts and terms, but
in the interest of focusing on graph data modeling, the next step will be adding our data to a graph
as nodes and edges.

### Constructing the Knowledge Graph

Now that our data is cleaned and we have abstracts associated with our terms, we are ready to begin constructing a knowledge graph. 

When we initially processed our raw abstract data earlier in the chapter, we created an increasing integer ID for each sequential abstract. These can now be used as igraph IDs in node creation.

Let's create node IDs for each terms. We will need a list of unique terms, which we can access from the abstract_entities variable, in the first element of each sublist.

In [18]:
terms = [abstract[1] for abstract in abstract_entities]
unique_terms = list(set(itertools.chain.from_iterable(terms)))

Now, we need to find all unique terms and assign one an integer ID. These IDs must also start from the last ID we assigned to an abstract, plust 1, so that each node has a unique igraph.

In [22]:
terms_ids = {term: i for i, term in enumerate(unique_terms, len(data))}
print(terms_ids['ige'])

5042


Now, that we have all the node informantion we need for our graph, we need to construct an edgelist to represent the interactions between terms and abstracts. All the information we need to do this is already held in the abstract_entities variable, but we need to convert abstracts and terms into integer IDs ready for import into an igraph graph. 

In [24]:
edgelist = []
for abstract_id, terms in abstract_entities:
    term_freq = dict(Counter(terms))

In our schema, we added an edge attribute to our FOUND_IN relationships. This is frequency, which will represent the number of times a term is used in a single abstract and act as a sort of weighting for how relevant a term is to its source

To find the frequencies of each term in an abstract, we can again use the collections. Counter() method. This will create a dictionay of {term: frequency} pairs we can use to add weight to each edge. 

Now, we need to loop through each term in the sublist associated with the current abstract and assemble an edge from integer IDs

In [26]:
for term, freq in term_freq.items():
    edgelist.append([int(terms_ids[term]), int(abstract_id), freq])
    
print(edgelist[:10])

[[10953, 2498, 2], [13421, 2498, 3], [13058, 2498, 1], [12668, 2498, 5], [16303, 2498, 1], [11409, 2498, 1], [13946, 2498, 3], [10539, 2498, 1], [18318, 2498, 1], [20546, 2498, 1]]


Let's confirm that an edge we know is present in the edgelist, in our first abstract, the term ige is used four separate times. This means the weighting of our edges should be equal to 4. We can use an assert to make sure an edge exits from the ID of ige to the node with ID 0, representing our first abstract, and ensure its weight.

In [33]:
assert [terms_ids['ige'], 0, 4] in edgelist

AssertionError: 