# Converting Scraped Data into IDPCentral

Alasdair J G Gray ([ORCID:0000-0002-5711-4872](http://orcid.org/0000-0002-5711-4872)), _Heriot-Watt University, Edinburgh, UK_

Petros Papadopoulos ([ORCID:0000-0002-8110-7576](https://orcid.org/0000-0002-8110-7576)), _Heriot-Watt University, Edinburgh, UK_

Ivan Mičetić ([ORCID:0000-0003-1691-8425](https://orcid.org/0000-0003-1691-8425)), _University of Padua, Italy_

Andras Hatos ([ORCID:0000-0001-9224-9820](https://orcid.org/0000-0001-9224-9820)), _University of Padua, Italy_

## Introduction

IDPCentral is the idea of having a central registry of proteins that are known to be disordered.

We aim to populate the content of the registry with Bioschemas markup that has been scraped using the BMUSE tool.

This notebook goes through the steps of converting the scraped content into the IDPCentral data model.

### IDPcentral Minimal Model Generation

Data is generated in JSON-LD format to support the IDPcentral UI. This only requires a minimal amount of data to be extracted from the crawls. The generated JSON-LD conforms to the following model.
```json
{
  "idp:name" :
  "idp:identifier" :
  "idp:sameAs" :
  "idp:sequence_range" : [
    {
      "idp:sequence_id" :
      "idp:start" :
      "idp:end" :
      "idp:range_annotation" :
    }
  ]
  "idp:resource_name" :
  "idp:last_update" :
}
```

### Knowledge Graph Generation

Notebook contains two approaches for extracting data from the crawls.

1. Intended to follow Wikidata approach, but limited to including a statement on the data about where it has come from;
  - Following types of data are extracted from their sources
    - `Protein`: using the Bioschemas Protein Profile ([0.11 RELEASE](https://bioschemas.org/profiles/Protein/0.11-RELEASE/)) as the data model for the information extracted
      - Protein identifiers reconciled to create a new Bioschemas Protein using the UniProt accession number as the local ID. This harmonises the proteins across the data sources
    - `SequenceAnnotation`: using the properties declared in the Bioschemas SequenceAnnotation ([0.1 DRAFT](https://bioschemas.org/types/SequenceAnnotation/0.1-DRAFT-2019_06_21/)) type
      - Source IRIs carried through
    - `PropertyValue`: using properties found in the data sources as there is no profile to guide us here
      - Source IRIs carried through
    - `SequenceRange`: using the properties declared in the Bioschemas SequenceRange ([0.1 DRAFT](https://bioschemas.org/types/SequenceRange/0.1-DRAFT-2019_06_21)) type
      - Source IRIs carried through
  - Extract metadata as a separate query and include as additional intermediary nodes
    - Encoutering problems creating the intermediary nodes, may be better to put KG in named graph and hang metadata off the graph, alhtough this would lose some of the advantages of mixing the data together in one graph
    - Hacked a metadata approach that adds the source of the protein data into the Protein extraction query

1. Using named graphs for each data crawl with provenance data included in the default graph (idea inspired by Open PHACTS approach to metadata tracking in the data cache).
  - Protein information extracted from all three data sources in a single query and stored in corresponding named graph for the page that has been crawled
  - Minimal provenance information for each named graph (`retrievedFrom` and `retrievedOn`) added to default graph

### To Dos/Issues

- ~~Return provenance data to IDPcentral: straightforward add properties from graph~~
- ~~Consider changing base namespace for IDPKG graph to either one you own or Wikidata~~
  - ~~Is it valid for us to hang our properties off a UniProt ID~~
  - ~~If we are using UniProt IDs we need to be consistent in our usage~~
  - ~~Using UniProt accession with Bioschemas namespace~~
- Add metadata properties and statements for the Wikidata approach
  - AG struggling to generate the required UUIDs using rdflib (see description above); probably is possible, but not sure that the benefit we get will be worth it
- [ ] Add query to retrieve the proteins that are retrieved from multiple sources over OPS data approach
- [ ] Add queries to extract other types of data for the OPS approach
- Retrieve UniProt label for IDPcentral using SPARQLWrapper to make external call and add properties
- Do we want to connect to Wikidata IDs?
- ~~IDPcentral is not getting updated with entries from mobidb~~
- MobiDB data getting mangled by BMUSE (currently testing with manually fixed file)
  - BMUSE `#` bug being removed; ready to be used for a fresh scrape
  - 2020-12 Data sources going to be redeployed without DataRecord. See more details in [Bioschemas#475](https://github.com/BioSchemas/specifications/issues/475). New deployment should be ready early 2021.
- Invesitage if [rdf-config](https://github.com/dbcls/rdf-config/blob/master/doc/spec.md) can be used to document the generated KG model
- IDPcentral json is not getting any data from PED
  - PED does not provide the following properties for the proteins (some of these are available for the structure group but not the individual protein). The relevant query patterns have been made optional.
    - name
    - additionalProperty
  - PED proteins now being included in the IDPcentral json-ld. At the moment the sequence ranges do not have a name associated with them. Checking with Andras and Ivan as to whether the `name` value in the `creationMethod` property should be used for this purpose.
  - PED markup being updated. __Wait for fixed markup before progressing this any further.__
  

## IDPKG Data Model

The IDPKG data model reuses ideas from [Wikidata](https://www.mediawiki.org/wiki/Wikibase/DataModel) whereby every statement loaded contains a provenance link as to where it was acquired.

- [ ] Document IDPCentral Model

### Identifiers

For the identifiers in the KG we are using the UniProt accession with Bioschemas namespace. This produces unique IRIs that are distinct from UniProts. While this means that there is a level of indirection in the integration, it relies on `sameAs` links, it provides the flexibility to choose whether to combine the data.

## Data Sources

The following databases have been scraped to populate IDPCentral
- [DisProt](https://www.disprot.org/)
- [MobiDb](https://mobidb.bio.unipd.it/)
- [Protein Ensemble Database](https://proteinensemble.org/)

## Conversion using RDFlib

This is an attempt to achieve the same functionality without using a triplestore.

### Import and configure the logging library

In [None]:
from datetime import datetime
import logging
logging.basicConfig(
    filename='idpETL.log', 
    filemode='w', 
    format='%(levelname)s:%(message)s', 
    level=logging.INFO)
logging.info('Starting processing at %s' % datetime.now().time())

Load in the RDFLib library.

In [None]:
from rdflib import ConjunctiveGraph, Dataset, Graph, RDF, URIRef

Template library used to template queries.

In [None]:
from string import Template

Import functions to list files in directory

In [None]:
from glob import glob

Import the ability to create a UUID

In [None]:
import uuid

### Wikidata Approach

Queries and methods for the Wikidata approach of organising the resulting KG.

Templated query for creating the direct properties for a protein entity, i.e. the data.

In [None]:
proteinQueryWD = Template("""
# Query to convert DisProt scraped data to IDPCentral model
# Defensive query: assumes that data does not conform to Protein profile

PREFIX bs: <https://bioschemas.org/entity/>
PREFIX pav: <http://purl.org/pav/>
PREFIX schema: <https://schema.org/>

CONSTRUCT {
    bs:${bsAccession} a schema:Protein ;
        schema:identifier ?identifier ;
        schema:name ?name ;
        schema:associatedDisease ?associatedDisease ;
        schema:description ?description ;
        schema:hasSequenceAnnotation ?annotation ;
        schema:isEncodedByBioChemEntity ?encodedBy ;
        schema:taxonomicRange ?taxonomicRange ;
        schema:url ?url ;
        schema:alternateName ?alternateName ;
        schema:bioChemInteraction ?bioChemInteraction ;
        schema:bioChemSimilarity ?bioChemSimilarity ;
        schema:hasBioChemEntityPart ?bioChemEntity ;
        schema:hasBioPloymerSequence ?sequence ;
        schema:hasMolecularFunction ?molFunction ;
        schema:hasRepresentation ?representation ;
        schema:image ?image ;
        schema:isInvolvedInBiologicalProcess ?process ;
        schema:isLocatedInSubcellularLocation ?cellularLocation ;
        schema:isPartOfBioChemEntity ?parentEntity ;
        schema:sameAs ?sameAs , ?s ;
        pav:retrievedFrom ?source.
}
WHERE {
    GRAPH ?g {
# Bioschemas Minimal Properties
        ?s a schema:Protein .
        OPTIONAL {?s schema:identifier ?identifier }
        OPTIONAL {?s schema:name ?name }
## Bioschemas Recommended properties
        OPTIONAL {?s schema:associatedDisease ?associatedDisease}
        OPTIONAL {?s schema:description ?description}
        OPTIONAL {?s schema:hasSequenceAnnotation ?annotation }
        OPTIONAL {?s schema:isEncodedByBioChemEntity ?encodedBy}
        OPTIONAL {?s schema:taxonomicRange ?taxonomicRange }
        OPTIONAL {?s schema:url ?url}
## Bioschemas Optional properties
        OPTIONAL {?s schema:alternateName ?alternateName}
        OPTIONAL {?s schema:bioChemInteraction ?bioChemInteraction}
        OPTIONAL {?s schema:bioChemSimilarity ?bioChemSimilarity}
        OPTIONAL {?s schema:hasBioChemEntityPart ?bioChemEntity}
        OPTIONAL {?s schema:hasBioPloymerSequence ?sequence}
        OPTIONAL {?s schema:hasMolecularFunction ?molFunction}
        OPTIONAL {?s schema:hasRepresentation ?representation }
        OPTIONAL {?s schema:image ?image}
        OPTIONAL {?s schema:isInvolvedInBiologicalProcess ?process}
        OPTIONAL {?s schema:isLocatedInSubcellularLocation ?cellularLocation}
        OPTIONAL {?s schema:isPartOfBioChemEntity ?parentEntity}
        OPTIONAL {?s schema:sameAs ?sameAs }
    }
    ?g pav:retrievedFrom ?source ;
            pav:retrievedOn ?date .
}
""")

Query to retrieve data about sequence annotations

In [None]:
sequenceAnnotationsQueryWD = """
PREFIX schema: <https://schema.org/>
CONSTRUCT {
  ?s a schema:SequenceAnnotation ;
        schema:additionalProperty ?addProp ;
        schema:citation ?citation ;
        schema:creationMethod ?method ;
        schema:description ?description ;
        schema:editor ?editor ;
        schema:isPartOfBioChemEntity ?bioChemEntity ;
        schema:sequenceLocation ?seqLoc .
}
WHERE {
  graph ?g {
    ?s a schema:SequenceAnnotation .
    OPTIONAL {?s schema:additionalProperty ?addProp }
    OPTIONAL {?s schema:citation ?citation }
    OPTIONAL {?s schema:creationMethod ?method }
    OPTIONAL {?s schema:description ?description }
    OPTIONAL {?s schema:editor ?editor }
    OPTIONAL {?s schema:isPartOfBioChemEntity ?bioChemEntity }
    OPTIONAL {?s schema:sequenceLocation ?seqLoc }
  }
}
"""

Retrieve triples about PropertyValues.

In [None]:
propertyValueQueryWD = """
PREFIX schema: <https://schema.org/>
CONSTRUCT {
    ?s a schema:PropertyValue ;
        schema:name ?name ;
        schema:value ?value .
}
where {
    graph ?g {
        ?s a schema:PropertyValue .
        OPTIONAL {?s schema:name ?name }
        OPTIONAL {?s schema:value ?value }
    }
}
"""

Retrieve triples about SequenceRange.

In [None]:
sequenceRangeQueryWD = """
PREFIX schema: <https://schema.org/>
CONSTRUCT {
    ?s a schema:SequenceRange ;
        schema:rangeStart ?start ;
        schema:rangeEnd ?end .
}
where {
    graph ?g {
        ?s a schema:SequenceRange .
        OPTIONAL {?s schema:rangeStart ?start }
        OPTIONAL {?s schema:rangeEnd ?end}
    }
}
"""

Templated query for creating the links to the provenance for each statement in the KG.

In [None]:
## First attempt
# createEntityProvenanceQuery = Template("""
# PREFIX bs: <https://bioschemas.org/entity/>
# PREFIX bsp: <https://bioschemas.org/ns/p/>
# PREFIX bss: <https://bioschemas.org/ns/s/>
# PREFIX bsr: <https://bioschemas.org/reference/>
# PREFIX pav: <http://purl.org/pav/>
# PREFIX prov: <http://www.w3.org/ns/prov#>
# PREFIX schema: <https://schema.org/>
# CONSTRUCT {
# #    bs:${bsAccession} a schema:Protein ;
# #        schema:identifier ?identifier ;
# #        schema:name ?name ;
# #        schema:hasSequenceAnnotation ?annotation ;
# #        schema:taxonomicRange ?taxonomicRange ;
# #        schema:hasRepresentation ?representation ;
# #        schema:sameAs ?sameAs , <${proteinIRI}>.
#     bs:${bsAccession} bsp:type [
#         prov:wasDerivedFrom <${refNodeIRI}> ;
#         bss:type schema:Protein
#     ] .
#     ?refNode pav:retrievedFrom ?source ;
#         pav:retrievedOn ?date .
# }
# WHERE {
#     GRAPH ?g {
# # Bioschemas Minimal Properties
#         <${proteinIRI}> a schema:Protein .#;
# #            schema:identifier ?identifier ;
# #            schema:name ?name ;
# # Bioschemas Recommended properties
# #            schema:hasSequenceAnnotation ?annotation ;
# #            schema:taxonomicRange ?taxonomicRange ;
# # Bioschemas Optional properties
# #            schema:sameAs ?sameAs .
# #        OPTIONAL {<${proteinIRI}> schema:hasRepresentation ?representation }
#         ?g pav:retrievedFrom ?source ;
#             pav:retrievedOn ?date .
#     }
# }
# """)

In [None]:
createEntityProvenanceQueryWD = Template("""
PREFIX bs: <https://bioschemas.org/entity/>
PREFIX bsp: <https://bioschemas.org/ns/p/>
PREFIX bss: <https://bioschemas.org/ns/s/>
PREFIX bsr: <https://bioschemas.org/reference/>
PREFIX pav: <http://purl.org/pav/>
PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX schema: <https://schema.org/>
CONSTRUCT {
    ?g pav:retrievedFrom ?source ;
        pav:retrievedOn ?date .
}
WHERE {
    GRAPH ?g {
# Bioschemas Minimal Properties
        <${proteinIRI}> a schema:Protein .#;
    
        ?g pav:retrievedFrom ?source ;
            pav:retrievedOn ?date .
    }
}
""")

Method for creating the KG entity and its metadata using the templated queries above.

In [None]:
def createWDKGEntity(g, protein, uniprot, accession):
    kgEntity = Graph()
    query = proteinQueryWD.substitute(proteinIRI=protein,bsAccession=accession)
    logging.debug('Query: %s' % query)
    kgEntity += g.query(query)
    kgEntity += g.query(sequenceAnnotationsQueryWD)
    logging.debug('SequenceAnnotation Query: %s' % sequenceAnnotationsQueryWD)
    kgEntity += g.query(propertyValueQueryWD)
    logging.debug('PropertyValue Query: %s' % propertyValueQueryWD)
    kgEntity += g.query(sequenceRangeQueryWD)
    logging.debug('SequenceRange Query: %s' % sequenceRangeQueryWD)
#    # Attempt to generate provenance statements as per Wikidata
#     u = uuid.uuid1()
#     refNode = "https://bioschemas.org/reference/" + accession + "-" + str(u)
#     query = createEntityProvenanceQuery.substitute(proteinIRI=protein,bsAccession=accession,refNodeIRI=refNode)
#     print(query)
#     kgEntity += g.query(query)
    # Attempt to extract provenance data
    query = createEntityProvenanceQueryWD.substitute(proteinIRI=protein)
    logging.debug('Provenance Query: %s' % query)
    provResult = g.query(query)
    return kgEntity

### OPS Approach Queries and Methods

Queries and methods for storing the data in a graph per source file and metadata on the graph IRI.

Query to extract the graph and its metadata

In [None]:
provenanceQueryOPS = """
PREFIX pav: <http://purl.org/pav/>
PREFIX prov: <http://www.w3.org/ns/prov#>
CONSTRUCT {
    ?g pav:retrievedFrom ?source ;
        pav:retrievedOn ?date .
}
WHERE {
    ?g pav:retrievedFrom ?source ;
        pav:retrievedOn ?date .
}
"""

Templated query for creating the direct properties for a protein entity, i.e. the data.

In [None]:
proteinQueryOPS = Template("""
# Query to convert Protein scraped data to BSKG OPS model
# Defensive query: assumes that data does not conform to Protein profile

PREFIX bs: <https://bioschemas.org/entity/>
PREFIX pav: <http://purl.org/pav/>
PREFIX schema: <https://schema.org/>

CONSTRUCT {
    bs:${bsAccession} a schema:Protein ;
        schema:identifier ?identifier ;
        schema:name ?name ;
        schema:associatedDisease ?associatedDisease ;
        schema:description ?description ;
        schema:hasSequenceAnnotation ?annotation ;
        schema:isEncodedByBioChemEntity ?encodedBy ;
        schema:taxonomicRange ?taxonomicRange ;
        schema:url ?url ;
        schema:alternateName ?alternateName ;
        schema:bioChemInteraction ?bioChemInteraction ;
        schema:bioChemSimilarity ?bioChemSimilarity ;
        schema:hasBioChemEntityPart ?bioChemEntity ;
        schema:hasBioPloymerSequence ?sequence ;
        schema:hasMolecularFunction ?molFunction ;
        schema:hasRepresentation ?representation ;
        schema:image ?image ;
        schema:isInvolvedInBiologicalProcess ?process ;
        schema:isLocatedInSubcellularLocation ?cellularLocation ;
        schema:isPartOfBioChemEntity ?parentEntity ;
        schema:sameAs ?sameAs , ?s .
}
WHERE {
    GRAPH ?g {
# Bioschemas Minimal Properties
        ?s a schema:Protein .
        OPTIONAL {?s schema:identifier ?identifier }
        OPTIONAL {?s schema:name ?name }
## Bioschemas Recommended properties
        OPTIONAL {?s schema:associatedDisease ?associatedDisease}
        OPTIONAL {?s schema:description ?description}
        OPTIONAL {?s schema:hasSequenceAnnotation ?annotation }
        OPTIONAL {?s schema:isEncodedByBioChemEntity ?encodedBy}
        OPTIONAL {?s schema:taxonomicRange ?taxonomicRange }
        OPTIONAL {?s schema:url ?url}
## Bioschemas Optional properties
        OPTIONAL {?s schema:alternateName ?alternateName}
        OPTIONAL {?s schema:bioChemInteraction ?bioChemInteraction}
        OPTIONAL {?s schema:bioChemSimilarity ?bioChemSimilarity}
        OPTIONAL {?s schema:hasBioChemEntityPart ?bioChemEntity}
        OPTIONAL {?s schema:hasBioPloymerSequence ?sequence}
        OPTIONAL {?s schema:hasMolecularFunction ?molFunction}
        OPTIONAL {?s schema:hasRepresentation ?representation }
        OPTIONAL {?s schema:image ?image}
        OPTIONAL {?s schema:isInvolvedInBiologicalProcess ?process}
        OPTIONAL {?s schema:isLocatedInSubcellularLocation ?cellularLocation}
        OPTIONAL {?s schema:isPartOfBioChemEntity ?parentEntity}
        OPTIONAL {?s schema:sameAs ?sameAs }
    }
}
""")

Method for creating the KG entity and its metadata using the templated queries above.

In [None]:
def createOPSKGEntity(g, ds, protein, uniprot, accession):
    # Retrieve provenance of crawl and add to default graph
    result = g.query(provenanceQueryOPS)
    # Insert provenance into default context
    for s, p, o in result:
        ds.add((s, p, o))
        # Store context of crawl
        context = (s)
    logging.debug('Context %s' % (context))
    # Parameterise the query with the proteinIRI and accession
    query = proteinQueryOPS.substitute(proteinIRI=protein,bsAccession=accession)
    logging.debug('Query: %s' % query)
    # Create context in Dataset for the crawled entity
    ds_g = ds.graph(URIRef(context))
    # Retrieve crawled entity
    result = g.query(query)
    logging.debug("\tconvert query has %s statements." % len(result))
    # Add crawled entity to Dataset
    ds_g += result
#     kgEntity += g.query(sequenceAnnotationsQueryWD)
#     logging.debug('SequenceAnnotation Query: %s' % sequenceAnnotationsQueryWD)
#     kgEntity += g.query(propertyValueQueryWD)
#     logging.debug('PropertyValue Query: %s' % propertyValueQueryWD)
#     kgEntity += g.query(sequenceRangeQueryWD)
#     logging.debug('SequenceRange Query: %s' % sequenceRangeQueryWD)

### Generic Queries and Methods

Queries and methods shared by both approaches

Query to extract source protein IRI and declared sameAs UniProt IRI

In [None]:
idQuery = """
PREFIX schema: <https://schema.org/>
SELECT ?proteinIRI ?uniprot
WHERE {
    GRAPH ?g {
        ?proteinIRI a schema:Protein ;
            schema:sameAs ?uniprot .
        FILTER regex(str(?uniprot), "^(https://www|http://purl).uniprot.org/uniprot/")
    }
}
"""

### IDP Central Queries
Function to extract just the triples that IDPCentral are using in their UI

In [None]:
idpQuery = """
PREFIX idp: <https://example.com/ipd/>
PREFIX pav: <http://purl.org/pav/>
PREFIX schema: <https://schema.org/>
CONSTRUCT {
    ?entry_url idp:name ?entry_name ;
        idp:identifier ?entry_id ;
        idp:sameAs ?uniprot_acc ;
        idp:sequence_range [
            idp:sequence_id ?sequenceID ;
            idp:start ?start ;
            idp:end ?end ;
            idp:range_annotation ?range_annotation
        ] ;
        idp:resource_name ?source ;
        idp:last_update ?date.
}
WHERE {
    GRAPH ?g {
        ?entry_url a schema:Protein ;
            schema:identifier ?entry_id ;
            schema:hasSequenceAnnotation ?sequenceID ;
            schema:sameAs ?uniprot_acc .
        OPTIONAL { 
            ?entry_url schema:name ?entry_name 
        }
        FILTER regex(str(?uniprot_acc), "^(https://www|http://purl).uniprot.org/uniprot/")
        ?sequenceID schema:sequenceLocation ?sequenceLocation .
        OPTIONAL {
            ?sequenceID schema:additionalProperty/schema:value/schema:name ?range_annotation 
        }
        ?sequenceLocation schema:rangeStart ?start ;
            schema:rangeEnd ?end.
        ?g pav:retrievedFrom ?source ;
            pav:retrievedOn ?date .
    }
}
"""
def idpExtraction(g):
    logging.debug('Query: %s' % idpQuery)
    results = g.query(idpQuery)
    logging.debug('\tQuery has %s statements.' % len(results))
    return results

### Data Extraction Control

Methods to process over the data files and control the ETL process.

Function to run over all files in a specified directory

In [None]:
def processDataFiles(idpWDKG, idpOPSKG, idpModel, directoryLocation):
    processed = 0
    for file in glob(directoryLocation + "*.nq"):
        logging.info("\tProcessing file: %s" % file)
        g = ConjunctiveGraph()
        g.parse(file, format="nquads")
        logging.info("\tSource has %s statements." % len(g))
        # Extract statements for IDPCentral
        idpModel += idpExtraction(g)
        logging.info("\tIDPcentral has %s statements." % len(idpModel))
        # Extract DisProt and UniProt IRIs
        results = g.query(idQuery)
        logging.info("\tID query result has %s statements." % len(results))
        # Convert to IDPCentral model
        for result in results:
            proteinIRI = result['proteinIRI']
            uniprotIRI = result['uniprot']
            logging.debug("\tProtein: %s\n\tUniProt: %s" % (proteinIRI, uniprotIRI))
            
            # Extract UniProt accession to use as an identifier in the Bioschemas namespace
            uniprotAccession = uniprotIRI[uniprotIRI.rindex('/')+1:]
            logging.info('Accession: %s' % uniprotAccession)

            # Create entity for Wikidata approach
            resGraph = createWDKGEntity(g, proteinIRI, uniprotIRI, uniprotAccession)
            logging.debug("\tconvert query has %s statements." % len(resGraph))
            idpWDKG += resGraph
            logging.info("\tIDPKG-WD has %s statements." % len(idpWDKG))
            
            # Create entity for OPS approach
            createOPSKGEntity(g, idpOPSKG, proteinIRI, uniprotIRI, uniprotAccession)
            logging.info("\tIDPKG-OPS has %s statements." % len(idpOPSKG))
        processed += 1
    return processed

Function to output data files for a graph

In [None]:
def outputFiles(graph, label):
    # print(graph.serialize(format='nt'))
    logging.info("%s has %s statements." % (label, len(graph)))
    graph.serialize(label+'.nt', format='nt')
    graph.serialize(label+'.jsonld', format='json-ld')
    logging.info('Successfully written all triples to %s.nt' % label)

Read in each nq data file in turn

Process each file and convert into IDPCentral model

In [None]:
idpWDKG = Graph()
idpOPSKG = Dataset()
idpModel = Graph()
totalProcessed = 0

print("Processing DisProt...", end='')
numberOfFiles = processDataFiles(idpWDKG, idpOPSKG, idpModel, "../scraped-data/disprot/")
print("%d files processed" % numberOfFiles)
totalProcessed += numberOfFiles

print("Processing MobiDB...", end='')
numberOfFiles = processDataFiles(idpWDKG, idpOPSKG, idpModel, "../scraped-data/mobidb/")
print("%d files processed" % numberOfFiles)
totalProcessed += numberOfFiles

print("Processing PED...", end='')
numberOfFiles = processDataFiles(idpWDKG, idpOPSKG, idpModel, "../scraped-data/ped/")
print("%d files processed" % numberOfFiles)
totalProcessed += numberOfFiles

outputFiles(idpModel, "IDPCentral")
outputFiles(idpWDKG, "IDPKG-WD")
# outputFiles(idpOPSKG, "IDPKG-OPS")
logging.info('Processed %d files' % totalProcessed)

assert (totalProcessed == 8), "Expected 8 data files to be present!"
assert (len(idpWDKG) == 222), "Expected 222 statements in IDP KG!"
assert (len(idpModel) == 120), "Expected 120 statements in IDPcentral Model!"
print('\nIDPcentral has %d statements.' % len(idpModel))
print('IDP WD KG has %d statements.' % len(idpWDKG))
print('IDP OPS KG has %d statements.' % len(idpOPSKG))
print('IDP OPS KG has %d contexts.' % sum(1 for _ in idpOPSKG.contexts()))
print('IDP OPS contexts:', '')
for c in idpOPSKG.contexts():
    print('\t%s' % c)
    print('\tNumber of statements %d' % len(c))
idpOPSKG.serialize('IDPKG-OPS.nq', format='nquads')
idpOPSKG.serialize('IDPKG-OPS.jsonld', format='json-ld')
print('\nIDP ETL process finished successfully!')

Expecting:
- 8 files to have been procesed
  - 3 from DisProt
  - 2 from MobiDB
  - 3 from PED
- IDPKG should have 222 statements
  - DisProt
    - 53 after 2241.nq
    - 91 after 2243.nq
    - no additional statements from 2244.nq
  - mobiDB
    - 116 after 4283.nq
    - 134 additional statements from 5729.nq
  - PED
    - 152 statements after 6000.nq
    - 198 statements after 6001.nq
    - 222 statements after 5999.nq
- IDPCentral should have 68 statements
  - DisProt
    - 24 after 2241.nq
    - 41 after 2243.nq
    - no additional statements from 2244.nq
  - mobiDB
    - 58 after 4283.nq
    - 68 additional statements from 5729.nq
  - PED
    - 76 statements after 6000.nq
    - 112 statements after 6001.nq
    - 120 statements after 5999.nq

### Incorporating PED into the Pipeline

File 6000.nq corresponds to entry [PED00148](https://proteinensemble.org/PED00148). There is valid markup on the page. Something in BMUSE is preventing this from coming through to the scraped data.

File 5999.nq corresponds to entry [PED00001](http://proteinensemble.org/PED00001). The markup shows one protein component.

File 6001.nq corresponds to entry [PED00174](http://proteinensemble.org/PED00174). The markup shows 3 protein components.

- [X] Follow up with Petros as to whether BMUSE fixes mean that a new scrape of this content would now work.
- [ ] Rescraped data uses same IRI for DataRecord and BioChemEntity; BioChemEntity needs `#DR` removed from the `@id` value
- [ ] `createdWith` needs to be a IRI, it can have values from it
- [ ] Use of `#DR` has undesirable effect for BMUSE's use of `retrieveFrom`

## Querying the IDP Knowledge Graphs

In the previous section we have created two knowledge graph representations of the IDPcentral data. In this section we will compare the two approaches by querying the knowledge graphs.

### Support Function

Function to print SPARQL results.

In [None]:
def displayResults(results):
    for row in results.bindings:
        for col in row:
            print(col, row[col], end = '\t')
        print()

In [None]:
# print("json", result.serialize(format="json"))
# for row in result:
#     print(row)
# print(result.serialize(format="json"))

### Knowledge Graph Statistics

#### Number of Triples

##### Wikidata Approach

In [None]:
triples = idpWDKG.query("SELECT (COUNT(*) AS ?triples) { ?s ?p ?o  }")
displayResults(triples)

##### Named Graph Approach

In [None]:
triples = idpOPSKG.query("SELECT (COUNT(*) AS ?triples) { { ?s ?p ?o } UNION { GRAPH ?g {?s ?p ?o  }}}")   
displayResults(triples)

#### Number of Typed Entities

##### Wikidata Approach

In [None]:
entities = idpWDKG.query("SELECT (COUNT(DISTINCT ?s) AS ?entities) { ?s a [] }")
displayResults(entities)

##### Named Graph Approach

In [None]:
entities = idpOPSKG.query("SELECT (COUNT(?s) AS ?entities) { { ?s a [] } UNION { GRAPH ?g { ?s a [] }}}")
displayResults(entities)

#### Instances per Class

##### Wikidata Approach

In [None]:
classCountQuery = """
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
SELECT ?Class (COUNT(DISTINCT ?s) AS ?distinctInstances) 
{ ?s a ?Class } 
GROUP BY ?Class
"""
classCounts = idpWDKG.query(classCountQuery)
displayResults(classCounts)

##### Named Graph Approach

In [None]:
classCountQuery = """
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
SELECT ?Class (COUNT(DISTINCT ?s) AS ?distinctInstances) 
{
    GRAPH ?g {
        ?s a ?Class
    }
} 
GROUP BY ?Class
"""
classCounts = idpOPSKG.query(classCountQuery)
displayResults(classCounts)

### Find proteins in multiple datasets

#### Wikidata Approach
Provenance information about the sources of triples has been included as a direct assertion on the protein resource. It will include multiple declarations if the triples about the protein have come from multiple pages. We cannot distinguish where any individual statement has come from. 

To find proteins with multiple sources, we need to group by the protein id and then use a `HAVING` clause to if there are more than two datasets. The datasets can be listed using a `GROUP_CONCAT` oeprator.

In [None]:
proteinQuery = """
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
SELECT ?protein (COUNT(?source) as ?numSources) (GROUP_CONCAT(?source;SEPARATOR=",") AS ?sources)
WHERE {
    ?protein a schema:Protein ;
        pav:retrievedFrom ?source .
}
GROUP BY ?protein
HAVING (COUNT(*) > 1)
"""
proteins = idpWDKG.query(proteinQuery)
displayResults(proteins)

#### Named Graph Approach

Provenance information is stored in the default graph as annotations on graph. 

A protein comes from multiple sources if the triple is found in multiple named graphs. The number of named graphs containing the triple indicates the number of sources containing the triple.

In [None]:
proteinQuery = """
PREFIX schema: <https://schema.org/>
PREFIX pav: <http://purl.org/pav/>
SELECT ?protein (COUNT(?g) as ?numSources) (GROUP_CONCAT(?source;SEPARATOR=",") AS ?sources)
WHERE {
    GRAPH ?g {
        ?protein a schema:Protein .
    }
    ?g pav:retrievedFrom ?source .
}
GROUP BY ?protein
HAVING (COUNT(*) > 1)
"""
proteins = idpOPSKG.query(proteinQuery)
displayResults(proteins)

### Find proteins with annotations in multiple datasets
Again we exploit the multiplicity of identifiers to check for multiple datasets. However, we now explicitly check that there are two; again we could add more.

Note that we have changed to a `CONSTRUCT` query since we end up with a duplicate rows in a tuple bindings approach since the identifiers can be bound first one way and then the other.

#### Problem
The problem with this query is that it only checks that the protein appears in both datasets, it does not check that the annotations come from different datasets.

#### Possible Solution
For each protein and annotation, add a statement stating the sources that it has come from in the data, or alternatively have a named graph per source.

In [None]:
proteinAnnotationQuery = """
PREFIX schema: <https://schema.org/>
CONSTRUCT {
    ?protein a schema:Protein ;
        schema:name ?proteinName ;
        schema:identifier ?id1, ?id2 ;
        schema:hasSequenceAnnotation [
            schema:description ?annotationDescription 
        ].
}
WHERE {
    ?protein a schema:Protein ;
        schema:name ?proteinName ;
        schema:identifier ?id1, ?id2 ;
        schema:hasSequenceAnnotation ?annotation .
    OPTIONAL {?annotation schema:description ?annotationDescription }
    FILTER (?id1 != ?id2) .
}
    
"""
print(idpWDKG.query(proteinAnnotationQuery).serialize(format='n3'))

### Find proteins  with annotations in only one source

### Find proteins with annotations of type X

### Find annotations with identical ranges

### Find annotations with overlapping ranges