# Converting Scraped Data into IDPCentral

Alasdair J G Gray ([ORCID:0000-0002-5711-4872](http://orcid.org/0000-0002-5711-4872))

_Heriot-Watt University, Edinburgh, UK_

## Introduction

IDPCentral is the idea of having a central registry of proteins that are known to be disordered.

We aim to populate the content of the registry with Bioschemas markup that has been scraped using the BMUSE tool.

This notebook goes through the steps of converting the scraped content into the IDPCentral data model.

## IDPCentral Data Model

The IDPCental data model reuses ideas from [Wikidata](https://www.mediawiki.org/wiki/Wikibase/DataModel) whereby every statement loaded contains a provenance link as to where it was acquired.

- [ ] Document IDPCentral Model

## Data Sources

The following databases have been scraped to populate IDPCentral
- [DisProt](https://www.disprot.org/)
- [MobiDb](https://mobidb.bio.unipd.it/)
- [Protein Ensemble Database](https://proteinensemble.org/)

## Python Libraries

We need some support libraries to allow us to interface with a SPARQL endpoint and to process the results. We also need `Graph` from `rdflib` to process the responses from CONSTRUCT queries.

In [None]:
from SPARQLWrapper import SPARQLWrapper, JSON, N3
from rdflib import Graph

In [None]:
from string import Template

## DisProt Conversion

First we configure the SPARQL endpoint where we have loaded the data and setup a function that will query the endpoint and return the results in JSON.

In [None]:
dpSparql = SPARQLWrapper("http://localhost:3030/disprot/sparql")
def dpSelect(query):
    dpSparql.setReturnFormat(JSON)
    dpSparql.setQuery(query)
    results = dpSparql.queryAndConvert()
    return results
def dpConstruct(query):
    dpSparql.setReturnFormat(N3)
    dpSparql.setOnlyConneg(True)
    dpSparql.setQuery(query)
    results = dpSparql.query().convert()
    g = Graph()
    g.parse(data=results, format="n3")
    return g

In [None]:
convertQuery = Template("""
PREFIX schema: <https://schema.org/>
CONSTRUCT {
    <${uniprotIRI}> a schema:Protein ;
        schema:identifier ?identifier ;
        schema:name ?name ;
        schema:hasSequenceAnnotation ?annotation ;
        schema:taxonomicRange ?taxonomicRange ;
        schema:hasRepresentation ?representation ;
        schema:sameAs ?sameAs , <${disprotIRI}>.
}
WHERE {
    GRAPH ?g {
# Bioschemas Minimal Properties
        <${disprotIRI}> a schema:Protein ;
            schema:identifier ?identifier ;
            schema:name ?name ;
# Bioschemas Recommended properties
            schema:hasSequenceAnnotation ?annotation ;
            schema:taxonomicRange ?taxonomicRange ;
# Bioschemas Optional properties
            schema:hasRepresentation ?representation ;
            schema:sameAs ?sameAs .
    }
}
""")
def convertDisprot(disprot, uniprot):
    query = convertQuery.substitute(disprotIRI=disprot,uniprotIRI=uniprot)
    #print(query)
    results = dpConstruct(query)
    return results


In [None]:
query = """
PREFIX schema: <https://schema.org/>
SELECT ?disprot ?uniprot
WHERE {
    GRAPH ?g {
        ?disprot a schema:Protein ;
            schema:sameAs ?uniprot .
        FILTER regex(str(?uniprot), "^https://www.uniprot.org/uniprot/")
    }
}
"""
results = dpSelect(query)
for result in results["results"]["bindings"]:
    disprot = result["disprot"]["value"]
    uniprot = result["uniprot"]["value"]
    response = convertDisprot(disprot, uniprot)
    print(response.serialize(format='nt'))

## Conversion using RDFlib

This is an attempt to achieve the same functionality without using a triplestore.

Load in the RDFLib library.

In [None]:
from rdflib import ConjunctiveGraph

Template library used to template queries.

In [None]:
from string import Template

Import functions to list files in directory

In [None]:
from glob import glob

Prepare query to extract UniProt and DisProt IRIs.

In [None]:
idQuery = """
PREFIX schema: <https://schema.org/>
SELECT ?disprot ?uniprot
WHERE {
    GRAPH ?g {
        ?disprot a schema:Protein ;
            schema:sameAs ?uniprot .
        FILTER regex(str(?uniprot), "^https://www.uniprot.org/uniprot/")
    }
}
"""

Prepared query for doing the conversion.

In [None]:
convertQuery = Template("""
PREFIX schema: <https://schema.org/>
CONSTRUCT {
    <${uniprotIRI}> a schema:Protein ;
        schema:identifier ?identifier ;
        schema:name ?name ;
        schema:hasSequenceAnnotation ?annotation ;
        schema:taxonomicRange ?taxonomicRange ;
        schema:hasRepresentation ?representation ;
        schema:sameAs ?sameAs , <${disprotIRI}>.
}
WHERE {
    GRAPH ?g {
# Bioschemas Minimal Properties
        <${disprotIRI}> a schema:Protein ;
            schema:identifier ?identifier ;
            schema:name ?name ;
# Bioschemas Recommended properties
            schema:hasSequenceAnnotation ?annotation ;
            schema:taxonomicRange ?taxonomicRange ;
# Bioschemas Optional properties
            schema:hasRepresentation ?representation ;
            schema:sameAs ?sameAs .
    }
}
""")

Method for running the prepared construct query.

In [None]:
def convertDisprot(disprot, uniprot):
    query = convertQuery.substitute(disprotIRI=disprot,uniprotIRI=uniprot)
#     print(query)
    return g.query(query)

Read in each nq data file in turn

Process each file and convert into IDPCentral model

In [None]:
for file in glob("../scraped-data/disprot/*.nq"):
    print("Processing file: %s" % file)
    g = ConjunctiveGraph()
    g.parse(file, format="nquads")
    # Extract DisProt and UniProt IRIs
    results = g.query(idQuery)
    # Convert to IDPCentral model
    for result in results:
        response = convertDisprot(result['disprot'], result['uniprot'])
        print(response.serialize(format='nt'))