# Converting Scraped Data into IDPCentral

Alasdair J G Gray ([ORCID:0000-0002-5711-4872](http://orcid.org/0000-0002-5711-4872))

_Heriot-Watt University, Edinburgh, UK_

## Introduction

IDPCentral is the idea of having a central registry of proteins that are known to be disordered.

We aim to populate the content of the registry with Bioschemas markup that has been scraped using the BMUSE tool.

This notebook goes through the steps of converting the scraped content into the IDPCentral data model.

## IDPCentral Data Model

The IDPCental data model reuses ideas from [Wikidata](https://www.mediawiki.org/wiki/Wikibase/DataModel) whereby every statement loaded contains a provenance link as to where it was acquired.

- [ ] Document IDPCentral Model

## Data Sources

The following databases have been scraped to populate IDPCentral
- [DisProt](https://www.disprot.org/)
- [MobiDb](https://mobidb.bio.unipd.it/)
- [Protein Ensemble Database](https://proteinensemble.org/)

## Conversion using RDFlib

This is an attempt to achieve the same functionality without using a triplestore.

Load in the RDFLib library.

In [None]:
from rdflib import ConjunctiveGraph, Graph

Template library used to template queries.

In [None]:
from string import Template

Import functions to list files in directory

In [None]:
from glob import glob

Prepare query to extract UniProt and DisProt IRIs.

In [None]:
idQuery = """
PREFIX schema: <https://schema.org/>
SELECT ?proteinIRI ?uniprot
WHERE {
    GRAPH ?g {
        ?proteinIRI a schema:Protein ;
            schema:sameAs ?uniprot .
        FILTER regex(str(?uniprot), "^(https://www|http://purl).uniprot.org/uniprot/")
    }
}
"""

Prepared query for doing the conversion.

In [None]:
convertQuery = Template("""
PREFIX schema: <https://schema.org/>
CONSTRUCT {
    <${uniprotIRI}> a schema:Protein ;
        schema:identifier ?identifier ;
        schema:name ?name ;
        schema:hasSequenceAnnotation ?annotation ;
        schema:taxonomicRange ?taxonomicRange ;
        schema:hasRepresentation ?representation ;
        schema:sameAs ?sameAs , <${proteinIRI}>.
}
WHERE {
    GRAPH ?g {
# Bioschemas Minimal Properties
        <${proteinIRI}> a schema:Protein ;
            schema:identifier ?identifier ;
            schema:name ?name ;
# Bioschemas Recommended properties
            schema:hasSequenceAnnotation ?annotation ;
            schema:taxonomicRange ?taxonomicRange ;
# Bioschemas Optional properties
            schema:sameAs ?sameAs .
        OPTIONAL {<${proteinIRI}> schema:hasRepresentation ?representation }
    }
}
""")

Method for running the prepared construct query.

In [None]:
def convertDisprot(g, protein, uniprot):
    query = convertQuery.substitute(proteinIRI=protein,uniprotIRI=uniprot)
#     print(query)
    return g.query(query)

Function to extract just the triples that IDPCentral are using in their UI

In [None]:
idpQuery = """
PREFIX schema: <https://schema.org/>
PREFIX idp: <https://example.com/ipd/>
CONSTRUCT {
    ?entry_url idp:name ?entry_name ;
        idp:identifier ?entry_id ;
        idp:sameAs ?uniprot_acc ;
        idp:sequence_range [
            idp:sequence_id ?sequenceID ;
            idp:start ?start ;
            idp:end ?end ;
            idp:range_annotation ?range_annotation
        ] .
}
WHERE {
    GRAPH ?g {
        ?entry_url a schema:Protein ;
            schema:name ?entry_name ;
            schema:identifier ?entry_id ;
            schema:hasSequenceAnnotation ?sequenceID ;
            schema:sameAs ?uniprot_acc .
        FILTER regex(str(?uniprot_acc), "^https://www.uniprot.org/uniprot/")
        ?sequenceID schema:sequenceLocation ?sequenceLocation ;
                  schema:additionalProperty/schema:value/schema:name ?range_annotation .
        ?sequenceLocation schema:rangeStart ?start ;
            schema:rangeEnd ?end.
    }
}
"""
def idpExtraction(g):
    return g.query(idpQuery)

Function to run over all files in a specified directory

In [None]:
def processDataFiles(idpKG, idpModel, directoryLocation):
    processed = 0
    for file in glob(directoryLocation + "*.nq"):
        print("\tProcessing file: %s" % file)
        g = ConjunctiveGraph()
        g.parse(file, format="nquads")
        print("\tSource has %s statements." % len(g))
        # Extract statements for IDPCentral
        idpModel += idpExtraction(g)
        print("\tIDPcentral has %s statements." % len(idpModel))
        # Extract DisProt and UniProt IRIs
        results = g.query(idQuery)
        print("\tID query result has %s statements." % len(results))
        # Convert to IDPCentral model
        for result in results:
            print("\tProtein: %s\n\tUniProt: %s" % (result['proteinIRI'], result['uniprot']))
            resGraph = convertDisprot(g, result['proteinIRI'], result['uniprot'])
            print("\tconvert query has %s statements." % len(resGraph))
            idpKG += resGraph
            print("\tIDPKG has %s statements." % len(idpKG))
        processed += 1
    return processed

Function to output data files for a graph

In [None]:
def outputFiles(graph, label):
    # print(graph.serialize(format='nt'))
    print("%s has %s statements." % (label, len(graph)))
    graph.serialize(label+'.nt', format='nt')
    graph.serialize(label+'.jsonld', format='json-ld')
    print('Successfully written all triples to %s.nt' % label)

Read in each nq data file in turn

Process each file and convert into IDPCentral model

__Issues:__
- IDPcentral is not getting updated with entries from mobidb

In [None]:
idpKG = Graph()
idpModel = Graph()
totalProcessed = 0

print("Processing DisProt...")
numberOfFiles = processDataFiles(idpKG, idpModel, "../scraped-data/disprot/")
print("Processed %d files" % numberOfFiles)
totalProcessed += numberOfFiles

print("Processing MobiDB...")
numberOfFiles = processDataFiles(idpKG, idpModel, "../scraped-data/mobidb/")
print("Processed %d files" % numberOfFiles)
totalProcessed += numberOfFiles

outputFiles(idpModel, "IDPCentral")
outputFiles(idpKG, "IDPKG")
print('Processed %d files' % totalProcessed)

__Problems with MobiDB data:__ using some incorrect properties

In [None]:
g = ConjunctiveGraph()
g.parse("../scraped-data/mobidb/4283.nq", format="nquads")
# print(g.serialize(format='nt'))
print("Source has %s statements." % len(g))
idpCenGraph = idpExtraction(g)
print("IDPcentral has %s statements." % len(idpCenGraph))
results = g.query(idQuery)
print("graph has %s statements." % len(results))
for result in results:
    print("Protein: %s\nUniProt: %s" % result)
for result in results:
    resGraph = convertDisprot(g, result['proteinIRI'], result['uniprot'])
    print("MobiDB has %s statements." % len(results))
outputFiles(resGraph, 'mobidb')