# Analysis of the IDP Knowledge Graph

__Authors:__
Alasdair J G Gray ([ORCID:0000-0002-5711-4872](http://orcid.org/0000-0002-5711-4872)), _Heriot-Watt University, Edinburgh, UK_

Petros Papadopoulos ([ORCID:0000-0002-8110-7576](https://orcid.org/0000-0002-8110-7576)), _Heriot-Watt University, Edinburgh, UK_

Ivan Mičetić ([ORCID:0000-0003-1691-8425](https://orcid.org/0000-0003-1691-8425)), _University of Padua, Italy_

Andras Hatos ([ORCID:0000-0001-9224-9820](https://orcid.org/0000-0001-9224-9820)), _University of Padua, Italy_

__License:__ Apache 2.0

__Acknowledgements:__ This notebook was created during the Virtual BioHackathon-Europe 2020.

## Introduction

This notebook contains SPARQL queries to perform a data analysis of the Intrinsically Disordered Protein (IDP) Knowledge Graph. The IDP knowledge graph was constructed from Bioschemas markup embedded in DisProt, MobiDb, and Protein Ensemble Database (PED) that was harvested using the Bioschemas Markup Scraper and Extractor and converted into a knowledge graph using the process in this [notebook](https://github.com/elixir-europe/BioHackathon-projects-2020/blob/master/projects/24/IDPCentral/notebooks/ETLProcess.ipynb). 

### Library Imports

In [None]:
# Import and configure logging library
from datetime import datetime
import logging
logging.basicConfig(
    filename='idpQuery.log', 
    filemode='w', 
    format='%(levelname)s:%(message)s', 
    level=logging.INFO)
logging.info('Starting processing at %s' % datetime.now().time())

In [2]:
# Imports from RDFlib
from rdflib import ConjunctiveGraph

### Result Display Function

The following function takes the results of a `SPARQL SELECT` query and displays them using a HTML table for human viewing.

In [3]:
def displayResults(queryResult):
   from IPython.core.display import display, HTML
   HTMLResult = '<table><tr style="color:white;background-color:#43BFC7;font-weight:bold">'
   # print variable names and build header:
   for varName in queryResult.vars:
       HTMLResult = HTMLResult + '<td>' + varName + '</td>'
   HTMLResult = HTMLResult + '</tr>'

   # print values from each row and build table of results
   for row in queryResult:
      HTMLResult = HTMLResult + '<tr>'   
      for column in row:
        #print("COLUMN:", column)
        if column is not "":
             HTMLResult = HTMLResult + '<td>' +  str(column) + '</td>'
        else:
             HTMLResult = HTMLResult + '<td>' + "N/A"+ '</td>'
      HTMLResult = HTMLResult + '</tr>'
   HTMLResult = HTMLResult + '</table>'
   display(HTML(HTMLResult))

## Loading IDP-KG

The data is read in from an N-QUADS file (`IDPKG.nq`). The data is expected to be in multiple named graphs, based on where the data was extracted from, with provenance data in the default graph.

In [4]:
idpKG = ConjunctiveGraph()
idpKG.parse("IDPKG.nq", format="nquads")
logging.info("\tIDP-KG has %s statements." % len(idpKG))

## Knowledge Graph Statistics

This section reports various statistics about the IDP-KG. The choice of statistics was inspired by the [HCLS Dataset Description Community Profile](https://www.w3.org/TR/hcls-dataset/#s6_6).

### Number of Triples

In [5]:
displayResults(idpKG.query("""
SELECT (COUNT(*) AS ?triples) 
WHERE {
  { ?s ?p ?o } 
  UNION 
  { GRAPH ?g 
    {?s ?p ?o  }
  }
}
"""))

0
triples
460


### Number of Typed Entities

Note that we use the `DISTINCT` keyword in the query since the same entity can appear in multiple named graphs.

In [6]:
displayResults(idpKG.query("""
SELECT (COUNT(DISTINCT ?s) AS ?entities) 
WHERE { 
  { ?s a [] } 
  UNION 
  { GRAPH ?g 
    { ?s a [] }
  }
}
"""))

0
entities
40


### Number of Unique Subjects

In [7]:
displayResults(idpKG.query("""
SELECT (COUNT(DISTINCT ?s) AS ?subjects) 
WHERE { 
  { ?s ?p ?o } 
  UNION 
  { GRAPH ?g 
    { ?s ?p ?o }
  }
}
"""))

0
subjects
47


### Number of Unique Properties

In [8]:
displayResults(idpKG.query("""
SELECT (COUNT(DISTINCT ?p) AS ?properties) 
WHERE { 
  { ?s ?p ?o } 
  UNION 
  { GRAPH ?g 
    { ?s ?p ?o }
  }
}
"""))

0
properties
19


### Number of Unique Objects

In [9]:
displayResults(idpKG.query("""
SELECT (COUNT(DISTINCT ?o) AS ?objects) 
WHERE { 
  { ?s ?p ?o } 
  UNION 
  { GRAPH ?g 
    { ?s ?p ?o }
  }
  FILTER(!isLiteral(?o))
}
"""))

0
objects
94


### Number of Unique Classes

In [10]:
displayResults(idpKG.query("""
SELECT (COUNT(DISTINCT ?o) AS ?classes) 
WHERE { 
  { ?s a ?o } 
  UNION 
  { GRAPH ?g 
    { ?s a ?o }
  }
}
"""))

0
classes
4


### Number of Unique Literals

In [11]:
displayResults(idpKG.query("""
SELECT (COUNT(DISTINCT ?o) AS ?objects) 
WHERE { 
  { ?s ?p ?o } 
  UNION 
  { GRAPH ?g 
    { ?s ?p ?o }
  }
  FILTER(isLiteral(?o))
}
"""))

0
objects
45


### Number of Graphs

In [12]:
displayResults(idpKG.query("""
SELECT (COUNT(DISTINCT ?g) AS ?graphs) 
WHERE { 
  GRAPH ?g 
    { ?s ?p ?o }
}
"""))

0
graphs
8


### Instances per Class

In [13]:
displayResults(idpKG.query("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
SELECT ?Class (COUNT(DISTINCT ?s) AS ?distinctInstances) 
WHERE {
    GRAPH ?g {
        ?s a ?Class
    }
} 
GROUP BY ?Class
ORDER BY ?Class
"""))

0,1
Class,distinctInstances
https://schema.org/PropertyValue,8
https://schema.org/Protein,7
https://schema.org/SequenceAnnotation,12
https://schema.org/SequenceRange,13


### Properties and their Occurence

In [14]:
displayResults(idpKG.query("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
SELECT ?p (COUNT(?p) AS ?triples) 
WHERE {
    { ?s ?p ?o }
    UNION
    {
        GRAPH ?g {
            ?s ?p ?o
        }
    }
} 
GROUP BY ?p
ORDER BY ?p
"""))

0,1
p,triples
http://purl.org/pav/retrievedFrom,14
http://purl.org/pav/retrievedOn,14
http://www.w3.org/1999/02/22-rdf-syntax-ns#type,81
https://schema.org/additionalProperty,16
https://schema.org/citation,10
https://schema.org/creationMethod,62
https://schema.org/description,12
https://schema.org/editor,10
https://schema.org/hasRepresentation,4


### Property, number of unique typed subjects, and triples

In [15]:
displayResults(idpKG.query("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
SELECT (COUNT(DISTINCT ?s) AS ?scount) ?stype ?p (COUNT(?p) AS ?triples) 
WHERE {
    { 
        ?s ?p ?o .
        ?s a ?stype
    }
    UNION
    {
        GRAPH ?g {
            ?s ?p ?o .
            ?s a ?stype 
        }
    }
} 
GROUP BY ?p ?stype
ORDER BY ?stype ?p
"""))

0,1,2,3
scount,stype,p,triples
8,https://schema.org/PropertyValue,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,16
8,https://schema.org/PropertyValue,https://schema.org/name,16
8,https://schema.org/PropertyValue,https://schema.org/value,16
7,https://schema.org/Protein,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,15
2,https://schema.org/Protein,https://schema.org/hasRepresentation,4
7,https://schema.org/Protein,https://schema.org/hasSequenceAnnotation,28
7,https://schema.org/Protein,https://schema.org/identifier,20
3,https://schema.org/Protein,https://schema.org/name,7
7,https://schema.org/Protein,https://schema.org/sameAs,47


### Number of Unique Typed Objects Linked to a Property

In [16]:
displayResults(idpKG.query("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
SELECT ?p (COUNT(?p) AS ?triples) ?otype (COUNT(DISTINCT ?o) AS ?count)
WHERE {
    { ?s ?p ?o }
    UNION
    {
        GRAPH ?g {
            ?s ?p ?o
        }
    }
} 
GROUP BY ?p ?otype
ORDER BY ?p
"""))

0,1,2,3
p,triples,otype,count
http://purl.org/pav/retrievedFrom,14,,7
http://purl.org/pav/retrievedOn,14,,7
http://www.w3.org/1999/02/22-rdf-syntax-ns#type,81,,4
https://schema.org/additionalProperty,16,,8
https://schema.org/citation,10,,2
https://schema.org/creationMethod,62,,21
https://schema.org/description,12,,3
https://schema.org/editor,10,,2
https://schema.org/hasRepresentation,4,,2


### Triples and Number of Unique Literals Related to a Property

In [17]:
displayResults(idpKG.query("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
SELECT ?p (COUNT(?p) AS ?triples) (COUNT(DISTINCT ?o) AS ?literals)
WHERE {
    { ?s ?p ?o }
    UNION
    {
        GRAPH ?g {
            ?s ?p ?o
        }
    }
    FILTER (isLiteral(?o))
} 
GROUP BY ?p
ORDER BY ?p
"""))

0,1,2
p,triples,literals
http://purl.org/pav/retrievedOn,14,7
https://schema.org/description,12,3
https://schema.org/hasRepresentation,4,2
https://schema.org/identifier,20,8
https://schema.org/name,23,4
https://schema.org/rangeEnd,26,11
https://schema.org/rangeStart,26,10


### Number of Unique Subject Types Linked to Unique Object Types

In [18]:
displayResults(idpKG.query("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
SELECT (COUNT(DISTINCT ?s) AS ?scount) ?stype ?p ?otype (COUNT(DISTINCT ?o) AS ?ocount)
WHERE {
    { 
        ?s ?p ?o .
        ?s a ?stype .
        ?o a ?otype .
    }
    UNION
    {
        GRAPH ?g {
            ?s ?p ?o .
            ?s a ?stype .
            ?o a ?otype .
        }
    }
} 
GROUP BY ?p ?stype ?otype
ORDER BY ?p
"""))

0,1,2,3,4
scount,stype,p,otype,ocount
8,https://schema.org/SequenceAnnotation,https://schema.org/additionalProperty,https://schema.org/PropertyValue,8
7,https://schema.org/Protein,https://schema.org/hasSequenceAnnotation,https://schema.org/SequenceAnnotation,12
12,https://schema.org/SequenceAnnotation,https://schema.org/sequenceLocation,https://schema.org/SequenceRange,13


## Find proteins in multiple datasets

Provenance information is stored in the default graph as annotations on graph.

A protein comes from multiple sources if the triple is found in multiple named graphs. The number of named graphs containing the triple indicates the number of sources containing the triple.

In [19]:
displayResults(idpKG.query("""
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
SELECT ?protein (COUNT(?g) as ?numSources) (GROUP_CONCAT(?source;SEPARATOR=", ") AS ?sources)
WHERE {
    GRAPH ?g {
        ?protein a schema:Protein .
    }
    ?g pav:retrievedFrom ?source .
}
GROUP BY ?protein
HAVING (COUNT(*) > 1)
"""))

0,1,2
protein,numSources,sources
https://bioschemas.org/entity/P03265,2,"https://dev.mobidb.org/P03265, https://disprot.org/DP00003"


## Find proteins with annotations in multiple datasets

We are looking for annotations where the protein is common but the annotation is different across the datasets.

First we'll write a query to find the proteins with annotations and return the provenance of where the annotation has come from.

In [20]:
displayResults(idpKG.query("""
PREFIX pav: <http://purl.org/pav/>
PREFIX schema: <https://schema.org/>
SELECT DISTINCT ?protein ?proteinName ?source1 ?annotation1 ?annotation2 ?source2
WHERE {
    GRAPH ?g1 {
        ?protein a schema:Protein ;
            schema:hasSequenceAnnotation ?annotation1 .
        OPTIONAL {?protein schema:name ?proteinName .}
    }
    ?g1 pav:retrievedFrom ?source1 .
    GRAPH ?g2 {
        ?protein a schema:Protein ;
            schema:hasSequenceAnnotation ?annotation2
    }
    ?g2 pav:retrievedFrom ?source2 .
    FILTER(?g1 != ?g2)
}
"""))

0,1,2,3,4,5
protein,proteinName,source1,annotation1,annotation2,source2
https://bioschemas.org/entity/P03265,DNA-binding protein,https://dev.mobidb.org/P03265,https://mobidb.org/P03265#prediction-disorder-mobidb_lite125_166,https://disprot.org/DP00003r002,https://disprot.org/DP00003
https://bioschemas.org/entity/P03265,DNA-binding protein,https://dev.mobidb.org/P03265,https://mobidb.org/P03265#prediction-disorder-mobidb_lite125_166,https://disprot.org/DP00003r004,https://disprot.org/DP00003
https://bioschemas.org/entity/P03265,DNA-binding protein,https://dev.mobidb.org/P03265,https://mobidb.org/P03265#prediction-disorder-mobidb_lite125_166,https://disprot.org/DP00003r003,https://disprot.org/DP00003
https://bioschemas.org/entity/P03265,DNA-binding protein,https://dev.mobidb.org/P03265,https://mobidb.org/P03265#prediction-disorder-mobidb_lite1_108,https://disprot.org/DP00003r002,https://disprot.org/DP00003
https://bioschemas.org/entity/P03265,DNA-binding protein,https://dev.mobidb.org/P03265,https://mobidb.org/P03265#prediction-disorder-mobidb_lite1_108,https://disprot.org/DP00003r004,https://disprot.org/DP00003
https://bioschemas.org/entity/P03265,DNA-binding protein,https://dev.mobidb.org/P03265,https://mobidb.org/P03265#prediction-disorder-mobidb_lite1_108,https://disprot.org/DP00003r003,https://disprot.org/DP00003
https://bioschemas.org/entity/P03265,DNA-binding protein,https://disprot.org/DP00003,https://disprot.org/DP00003r003,https://mobidb.org/P03265#prediction-disorder-mobidb_lite125_166,https://dev.mobidb.org/P03265
https://bioschemas.org/entity/P03265,DNA-binding protein,https://disprot.org/DP00003,https://disprot.org/DP00003r003,https://mobidb.org/P03265#prediction-disorder-mobidb_lite1_108,https://dev.mobidb.org/P03265
https://bioschemas.org/entity/P03265,DNA-binding protein,https://disprot.org/DP00003,https://disprot.org/DP00003r002,https://mobidb.org/P03265#prediction-disorder-mobidb_lite125_166,https://dev.mobidb.org/P03265


The following query finds for each protein, its name (if known), a count of the number of sequence annotations, and a count of the number of sources from which the data has been extracted. Results are only returned if there are annotations from more than one source.

In [21]:
displayResults(idpKG.query("""
PREFIX pav: <http://purl.org/pav/>
PREFIX schema: <https://schema.org/>
SELECT ?protein (SAMPLE(?proteinName) AS ?name) (COUNT(distinct ?annotation) AS ?annotationCount) (COUNT(distinct ?source) AS ?sourceCount)
WHERE {
    {
        SELECT DISTINCT ?protein ?proteinName
        WHERE {
		    GRAPH ?g {
        		?protein a schema:Protein .
		        OPTIONAL {?protein schema:name ?proteinName .}
		    }
        }
    }
    {
	    SELECT ?annotation ?source ?protein
    	WHERE {
        	GRAPH ?g {
            	?protein schema:hasSequenceAnnotation ?annotation
	        }
    	    ?g pav:retrievedFrom ?source .
	    }
    }
} 
GROUP BY ?protein
HAVING (COUNT(distinct ?source) > 1)
ORDER BY DESC(?annotationCount)
"""))

0,1,2,3
protein,name,annotationCount,sourceCount
https://bioschemas.org/entity/P03265,DNA-binding protein,5,2


The following varient of the query will list the annotations and the source from which the annotation has come.

In [22]:
displayResults(idpKG.query("""
PREFIX pav: <http://purl.org/pav/>
PREFIX schema: <https://schema.org/>
SELECT ?protein ?proteinName ?annotation ?source
WHERE {
    {
        SELECT DISTINCT ?protein ?proteinName
        WHERE {
		    GRAPH ?g {
        		?protein a schema:Protein .
		        OPTIONAL {?protein schema:name ?proteinName .}
		    }
        }
    }
    {
        SELECT ?annotation ?source ?protein
        WHERE {
            GRAPH ?g {
                ?protein schema:hasSequenceAnnotation ?annotation
            }
            ?g pav:retrievedFrom ?source .
        }
    }
} 
ORDER BY ?protein ?annotation
"""))

0,1,2,3
protein,proteinName,annotation,source
https://bioschemas.org/entity/P03255,,https://proteinensemble.org/PED00174#SRA_P06400,https://proteinensemble.org/PED00174
https://bioschemas.org/entity/P03255,,https://proteinensemble.org/PED00174#SRB_P03255,https://proteinensemble.org/PED00174
https://bioschemas.org/entity/P03265,DNA-binding protein,https://disprot.org/DP00003r002,https://disprot.org/DP00003
https://bioschemas.org/entity/P03265,DNA-binding protein,https://disprot.org/DP00003r003,https://disprot.org/DP00003
https://bioschemas.org/entity/P03265,DNA-binding protein,https://disprot.org/DP00003r004,https://disprot.org/DP00003
https://bioschemas.org/entity/P03265,DNA-binding protein,https://mobidb.org/P03265#prediction-disorder-mobidb_lite125_166,https://dev.mobidb.org/P03265
https://bioschemas.org/entity/P03265,DNA-binding protein,https://mobidb.org/P03265#prediction-disorder-mobidb_lite1_108,https://dev.mobidb.org/P03265
https://bioschemas.org/entity/P06400,,https://proteinensemble.org/PED00174#SRA_P06400,https://proteinensemble.org/PED00174
https://bioschemas.org/entity/P06400,,https://proteinensemble.org/PED00174#SRB_P03255,https://proteinensemble.org/PED00174
