<a href="https://colab.research.google.com/github/elisasmenendez/inforank/blob/master/inforank.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Context

A key contributor to the success of keyword search systems is a ranking mechanism that considers the importance of the retrieved documents. The notion of importance in graphs is typically computed using centrality measures that highly depend on the degree of the nodes, such as PageRank. However, in RDF graphs, the notion of importance is not necessarily related to the node degree.

To solve this problem, we propose a novel family of importance measures for RDF graphs, collectively called InfoRank, that combine three intuitions: (I) “important things have lots of information about them”; (II) “important things are surrounded by other important things”; (III) “few important relations (e.g. friends) are better than many unimportant relations (e.g. acquaintances)”.

In this implementation, we show how to compute InfoRank and PageRank using SPARQL queries since RDF graphs are typically stored in triplestores. We use RDFLib to simulate this environment.

# Running Example

To exemplify the implementation steps, consider the simple graph shown in the following image with IRIs denoted in oval and literals denoted in dashed boxes (Colab image is not working).

https://drive.google.com/uc?export=view&id=1NHAl72vWPRlFOiQet9bF5--VjxsQStIZ

Consider that the graph also includes the following triples: 
 (X, rdf:type, rdfs:Class), 
 (Y, rdf:type, rdfs:Class), 
 (Z, rdf:type, rdfs:Class)
 (x1, rdf:type, X), 
 (x2, rdf:type, X), 
 (y1, rdf:type, Y), 
 (y2, rdf:type, Y),
 (z1, rdf:type, Z), 
 (z2, rdf:type, Z), … (z10, rdf:type, Z)

Let's see how to install RDFLib and load this data.


In [None]:
pip install rdflib

In [None]:
import rdflib

g = rdflib.Graph()
g.parse("https://raw.githubusercontent.com/elisasmenendez/inforank/master/example.ttl", format="turtle")

g.bind("ex", "http://example.com/")
g.bind("quira", "http://quira.com/")

Here is how we print all triples.

In [None]:
qres = g.query("select ?s ?p ?o where {?s ?p ?o }",)

for row in qres:
  print ("%s %s %s" % row)

# Instance Informativeness

As the first step to achieve the InfoRank score, we compute the absolute informativeness of instances by counting their literals. Note that, since our graph contains schema information, we apply a graph pattern that requires that variable ?r be bound only to instances of some class. We also assume that are no blank nodes. Furthermore, since some instances may not have literals, we need to make the count of literals optional, so if there is no bound, the value returned is 0. 

In [None]:
# Computing the informativesses of instances

g.update("""
insert { ?r quira:infoness ?infoness }
where { 
  select ?r (count(?o) as ?infoness)
  where {
    ?r rdf:type/rdf:type rdfs:Class .
    OPTIONAL {
      ?r ?p ?o . 
      filter ( isLiteral(?o) )  
    }
  }
  group by ?r 
}
""")

Let's check the results

In [None]:
qres = g.query("""
select ?s ?o
where { ?s quira:infoness ?o }  
order by ?s  
""")

for row in qres:
  print ("%s %s" % row)

# InfoRank - Ranking Schema Data

Now, we compute the inforank score of classes considering the informativeness of their instances, to capture the idea that “important classes usually have informative instances”.

In [None]:
# Computing the InfoRank of classes
g.update("""
insert { ?c quira:inforank ?inforank }
where { 
  select ?c (max(?infoness) as ?inforank)
  where { 
    ?c rdf:type rdfs:Class .
    ?r rdf:type ?c .
    ?r quira:infoness ?infoness .
  }
  group by ?c
}
""")

And we can rank the classes using the descending order of their InfoRank score.

In [None]:
qres = g.query("""
select ?c ?inforank
where { 
  ?c rdf:type rdfs:Class . 
  ?c quira:inforank ?inforank 
}  
order by desc(?inforank)
""")

for row in qres:
  print ("%s %s" % row)

Likewise, we compute the InfoRank score of object properties based on the informativeness of instances, to capture the idea that “important properties are usually those connecting informative instances”. 

In [None]:
# Computing the InfoRank of object properties
g.update("""
insert { ?p quira:inforank ?inforank }
where { 
  select ?p (max(?info_r + ?info_s) as ?inforank)
  where { 
    ?r ?p ?s .
    ?r rdf:type/rdf:type rdfs:Class .
    ?s rdf:type/rdf:type rdfs:Class .
    ?r quira:infoness ?info_r .
    ?s quira:infoness ?info_s .
  }
  group by ?p
}
""")

And we can rank the object properties using the descending order of their InfoRank score.

In [None]:
qres = g.query("""
select distinct ?p ?inforank
where { 
  ?s ?p ?o .
  ?p quira:inforank ?inforank .
}  
order by desc(?inforank)
""")

for row in qres:
  print ("%s %s" % row)

# InfoRank - Ranking Data

Note that we used only Intuition I in our strategies to rank metadata resources. However, we propose a combination of the three intuitions to rank the data itself, that is, the instances. 

To do that, we first execute a weighted version of PageRank using the InfoRank score of properties as the edge weight. Hence, to help achieve a normalized weight, we first compute an auxiliary property named sumInfo.

In [None]:
# Computing sumInfo
g.update("""
insert { ?r quira:sumInfo ?sumInfo }
where {
  select ?r (sum(?info) as ?sumInfo)
  where { 
    { select *
      where {
        ?r ?q ?t .
        ?q quira:inforank ?info .
      }
    }
    UNION
    { select *
      where {
        ?t ?q ?r .
        ?q quira:inforank ?info .
      }
    }
  }
  group by ?r
}
""")

Now, we initialize the PageRank scores with *1/n*, in which *n* is the number of instances in the graph.

In [None]:
# Counting the number of instances in the graph
qres = g.query("""
select (count(*) as ?n)
where { ?r rdf:type/rdf:type rdfs:Class }
""")
n = qres.bindings[0]['n']
last = 'score1'
curr = 'score2'

# Initializing all scores with 1/N
g.update("""
insert { ?r quira:%s ?score }
where { 
  ?r rdf:type/rdf:type rdfs:Class .
  BIND( (1/%s) as ?score )
}
""" % (last,n) )

Finally, we execute the Power Iteration Method to compute PageRank.

In [None]:
# This query simulates an iteration with a dumping factor of 0.85  
queryIter = """
insert { ?r quira:%s ?score }
where {
  select ?r (( ((1-0.85)/%s) + 
               (0.85 * sum(?score * (?infoP/?sumInfo) )) ) as ?score)
  where { 
    { select *
      where {
        ?r rdf:type/rdf:type rdfs:Class .
        ?s ?p ?r .
        ?p quira:inforank ?info .
        ?s quira:%s ?score .
        ?p quira:inforank ?infoP .
        ?s quira:sumInfo ?sumInfo .
      }
    }
    UNION
    { select *
      where {
        ?r rdf:type/rdf:type rdfs:Class .
        ?r ?p ?s .
        ?p quira:inforank ?info .
        ?s quira:%s ?score .
        ?p quira:inforank ?infoP .
        ?s quira:sumInfo ?sumInfo .
      }
    } 
  }
  group by ?r
}
"""

# This query simulates the convergence calculation 
queryConv = """
select (sum(abs(?curr - ?last)) as ?conv)
where { 
  ?r quira:%s ?curr .
  ?r quira:%s ?last . 
}
"""

# Executing the Power Iteration method
tol = 1.0e-4
max = 100
converged = 0

for i in range(max-1):
  
  # Iterating
  g.update(queryIter % (curr, n, last, last))
  
  # Checking convergence
  qres = g.query(queryConv % (curr, last))
  conv = float(qres.bindings[0]['conv'])

  # print("Iteration %s - Convergence %s" % (i, conv))  

  if conv < tol:
    converged = 1 
    g.update("insert { ?r quira:pagerankW ?score } where { ?r quira:%s ?score }" % curr)
    print("Converged after %s iterations" % i)
    
    # Let's do some cleaning
    g.update("delete where { ?s quira:%s ?o }" % last)
    g.update("delete where { ?s quira:%s ?o }" % curr)
    g.update("delete where { ?s quira:sumInfo ?o }")
    break

  else:
    g.update("delete where { ?s quira:%s ?o }" % last)
    temp = last
    last = curr
    curr = temp

if converged == 0:
  print("Failed to converge after %s iterations" % i) 

Let's check the result. 

In [None]:
qres = g.query("""
select ?r ?score
where {?r quira:pagerankW ?score }  
order by desc(?score)
""")

for row in qres:
  print ("%s %s" % row)

Compare with the result given by Networkx.

In [None]:
import networkx as nx
import operator

edges = [
  ('X1', 'Y1', 8),
  ('X2', 'Y1', 8),
  ('Y2', 'Y1', 7),
  ('Z1', 'Y2', 2),
  ('Z2', 'Y2', 2),
  ('Z3', 'Y2', 2),
  ('Z4', 'Y2', 2),
  ('Z5', 'Y2', 2),
  ('Z6', 'Y2', 2),
  ('Z7', 'Y2', 2),
  ('Z8', 'Y2', 2),
  ('Z9', 'Y2', 2),
  ('Z10', 'Y2', 2)]

G = nx.Graph()  
G.add_weighted_edges_from(edges)  

res = nx.pagerank(G)

sorted_res = sorted(res.items(), key=operator.itemgetter(1), reverse=True)  
for i,j in sorted_res:
  print(i + " - " + str(j))


Note that, although we use the InfoRank of properties to weight the edges, instance Y2 still gets a higher score than Y1. Hence, as the final step, we combine the weighted PageRank with the informativeness of instances. 

In [None]:
g.update("""
insert { ?r quira:inforank ?info }    
where {
  select ?r ((?score * ?infoness) as ?info)
  where {
    ?r quira:pagerankW ?score .
    ?r quira:infoness ?infoness .
  }
}
""")

And the final result is...

In [None]:
qres = g.query("""
select ?r ?score
where { 
  ?r rdf:type/rdf:type rdfs:Class .
  ?r quira:inforank ?score .
}  
order by desc(?score)
""")

for row in qres:
  print ("%s %s" % row)