#### Linking keywords and named entities
This notebook provides code to link the keywords automatically extracted from an abstract's hypothesis (completed in other notebook) with a skos:related predicate to the named entities from the CORD19 NEKG dataset.

In [1]:
from rdflib import Graph, Literal, Namespace, RDF, URIRef

In [2]:
# load the keyword instances graph and bnode graph because this is the data we want to link (decided to focus ont these two for now and then later attempt with mesh)
kw = Graph()
ne = Graph()
kw.parse("../Parsing/outputs/hypothesis-keywords-graph.ttl", format="turtle")
ne.parse("../Parsing/outputs/bnode_graph.ttl", format="turtle")

<Graph identifier=N8a17b76ded5b4dda9d64a012f39d3613 (<class 'rdflib.graph.Graph'>)>

In [3]:
# get the namespaces for running the sparql queries in python
def create_namespace(graph, namespace, prefix):

    ns = Namespace(namespace)
    graph.namespace_manager.bind(prefix, namespace)
    
    return ns


hyp_namespace = create_namespace(ne, "http://example.org/hypothesis_ontology/", 'hyp')
oa_namespace = create_namespace(ne, "http://www.w3.org/ns/oa#", 'oa')

hyp_namespace = create_namespace(kw, "http://example.org/hypothesis_ontology/", 'hyp')
oa_namespace = create_namespace(kw, "http://www.w3.org/ns/oa#", 'oa')

In [4]:
def get_abstracts(graph):
    abstracts = []
    for s, p, o in graph.triples( (None, oa_namespace.hasSource, None) ):
        keyword = graph.value(s,p)
        abstracts.append(keyword)
    return abstracts

# get_abstracts(kw)

#### Execute sparql queries to obtain desired info from respective graphs

In [5]:
ne_q = ne.query(
"""
prefix dct: <http://purl.org/dc/terms/> 
prefix hyp: <http://example.org/hypothesis_ontology/> 
prefix oa: <http://www.w3.org/ns/oa#>
SELECT ?ne ?literal ?abstract
WHERE {
  ?ne oa:hasSource ?abstract; oa:exact ?literal.
}
"""
)

ne_data = []
for i in ne_q:
    ne_dict = dict()
    ne_dict['ne_id'] = i.ne
    ne_dict['literal'] = i.literal
    ne_dict['abstract'] = i.abstract
    ne_data.append(ne_dict)

In [6]:
kw_q = kw.query(
"""
prefix dct: <http://purl.org/dc/terms/> 
prefix hyp: <http://example.org/hypothesis_ontology/> 
prefix oa: <http://www.w3.org/ns/oa#>
SELECT ?kw ?literal ?abstract
WHERE {
  ?hyp oa:hasSource ?abstract; hyp:contains ?kw.
  ?kw oa:hasTarget ?literal.
}
"""
)

kw_data = []
for i in kw_q:
    kw_dict = dict()
    kw_dict['kw_id'] = i.kw
    kw_dict['literal'] = i.literal
    kw_dict['abstract'] = i.abstract
    kw_data.append(kw_dict)

In [7]:
ne_data[11]['literal']

rdflib.term.Literal('IVIG')

### Filtering duplicates

Here I delete duplicates both in ne_data and kw_data. Basically I check for every pair (ne, ne2) from ne_data if literals, abstracts are the same and if ids are different. In that case we delete the second one, because of information redundancy. Same thing is done for kw_data.

In [8]:
for ne in ne_data:
    for ne2 in ne_data:
        if ne['abstract'] == ne2['abstract'] and ne['literal'] == ne2['literal'] and ne['ne_id'] != ne2['ne_id']:
            ne_data.remove(ne2)
            
print(len(ne_data))

8061


In [9]:
for kw in kw_data:
    for kw2 in kw_data:
        if kw['abstract'] == kw2['abstract'] and kw['literal'] == kw2['literal'] and kw['kw_id'] != kw2['kw_id']:
            kw_data.remove(kw2)
            
print(len(kw_data))

7290


## Adding skos triples

In every case I add skos.related triples only for keywords and named entities that have lenght of literal >= 3. It filters out keywords or named entities that represent pointless nodes (like literals: " ", "-", "", "OC" etc.).
Algorithm works like:
1. For every keyword that has length >=3
2. For every named entity that has length >=3
3. Check if skos.related applies to a pair (keyword, named entity).

### When we consider skos.related if literals are exactly the same.

Here I assume that keyword and named entities HAVE TO BE from the same abstract. In case they are from the same abstract, skos.related appears only in case literals are exactly the same.

In [22]:
g = Graph()
skos_namespace = create_namespace(g, "http://www.w3.org/2004/02/skos/core#", 'skos')
for kw in kw_data:
    if len(str(kw['literal'])) < 3:
        continue
    for ne in ne_data:
        if len(str(ne['literal'])) < 3:
            continue
        if kw['abstract'] != ne['abstract']:
            continue
        if str(kw['literal']) == str(ne['literal']):
            g.add((kw['kw_id'],skos_namespace.related, ne['ne_id']))
            
g.serialize('./skos_same_abstract_equal.ttl', format="turtle")
print(len(g))

407


Here I assume that keyword and named entities DOES NOT HAVE TO BE from the same abstract. skos.related appears only in case literals are exactly the same.

In [23]:
g = Graph()
skos_namespace = create_namespace(g, "http://www.w3.org/2004/02/skos/core#", 'skos')
for kw in kw_data:
    if len(str(kw['literal'])) < 3:
        continue
    for ne in ne_data:
        if len(str(ne['literal'])) < 3:
            continue
        if str(kw['literal']) == str(ne['literal']):
            g.add((kw['kw_id'],skos_namespace.related, ne['ne_id']))
            
g.serialize('./skos_equal.ttl', format="turtle")
print(len(g))

8795


### When we consider skos.related if one of the literals is contained in another one.

Here I assume that keyword and named entities HAVE TO BE from the same abstract. In case they are from the same abstract, skos.related appears only in case literal of keyword is inside literal of named entity, or named entity literal is inside keyword literal.

In [10]:
g = Graph()
skos_namespace = create_namespace(g, "http://www.w3.org/2004/02/skos/core#", 'skos')
for kw in kw_data:
    if len(str(kw['literal'])) < 3:
        continue
    for ne in ne_data:
        if len(str(ne['literal'])) < 3:
            continue
        if kw['abstract'] != ne['abstract']:
            continue
        if str(kw['literal']) in str(ne['literal']):
            g.add((kw['kw_id'],skos_namespace.related, ne['ne_id']))
        elif str(ne['literal']) in str(kw['literal']):
            g.add((kw['kw_id'],skos_namespace.related, ne['ne_id']))
            
g.serialize('./skos_same_abstract_contain.ttl', format="turtle")
print(len(g))

1738


Here I assume that keyword and named entities DOES NOT HAVE TO BE from the same abstract. skos.related appears only in case literal of keyword is inside literal of named entity, or named entity literal is inside keyword literal.

In [25]:
g = Graph()
skos_namespace = create_namespace(g, "http://www.w3.org/2004/02/skos/core#", 'skos')
for kw in kw_data:
    if len(str(kw['literal'])) < 3:
        continue
    for ne in ne_data:
        if len(str(ne['literal'])) < 3:
            continue
        if str(kw['literal']) in str(ne['literal']):
            g.add((kw['kw_id'],skos_namespace.related, ne['ne_id']))
        elif str(ne['literal']) in str(kw['literal']):
            g.add((kw['kw_id'],skos_namespace.related, ne['ne_id']))
            
g.serialize('./skos_contain.ttl', format="turtle")
print(len(g))

121905
