This workflow shows how Python can be used to aggregated database identifiers and physicochemical properties, starting with the SMILES line notation of a chemical compound. The first thing it will do is lookup the record for that compound in the VHP4Safety Compound Wiki. Using mappings to Wikidata it will then retrieve additional information.

This workflow was developed as part of the VHP4Safety project by Maastricht University. The tip for using the `wikidataintegrator` came from Andra.

**Installation**

First, we need to install the required software.

In [None]:
!apt install maven openjdk-11-jdk
!pip install scyjava
!pip install wikidataintegrator

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  fonts-dejavu-core fonts-dejavu-extra libaopalliance-java libapache-pom-java
  libatinject-jsr330-api-java libatk-wrapper-java libatk-wrapper-java-jni libcdi-api-java
  libcommons-cli-java libcommons-io-java libcommons-lang3-java libcommons-parent-java libfontenc1
  libgeronimo-annotation-1.3-spec-java libgeronimo-interceptor-3.0-spec-java libguava-java
  libguice-java libhawtjni-runtime-java libice-dev libjansi-java libjansi-native-java
  libjsr305-java libmaven-parent-java libmaven-resolver-java libmaven-shared-utils-java
  libmaven3-core-java libplexus-cipher-java libplexus-classworlds-java
  libplexus-component-annotations-java libplexus-interpolation-java libplexus-sec-dispatcher-java
  libplexus-utils2-java libsisu-inject-java libsisu-plexus-java libslf4j-java libsm-dev
  libwagon-file-java libwagon-http-shaded-java libwagon-provid

In [None]:
from scyjava import config, jimport
from wikidataintegrator import wdi_core

config.endpoints.append('org.openscience.cdk:cdk-bundle:2.9')
SmilesParser = jimport('org.openscience.cdk.smiles.SmilesParser')
Builder = jimport('org.openscience.cdk.silent.SilentChemObjectBuilder')
InChIGeneratorFactory = jimport('org.openscience.cdk.inchi.InChIGeneratorFactory')
INCHI_RET =  jimport('net.sf.jniinchi.INCHI_RET')

**SMILES to InChIKey**

The SMILES is a line notation describing a chemical structure, but not a unique identifier. Using the SMILES string for a lookup will gnerally result in false negatives. The InChI and InChIkey have been developed for this. We therefore first convert the SMILES into a InChIKey.

In [None]:
sp = SmilesParser(Builder.getInstance())
mol = sp.parseSmiles("CC(=C)[C@H]1CC2=C(O1)C=CC3=C2O[C@@H]4COC5=CC(=C(C=C5[C@@H]4C3=O)OC)OC")
print(f"This compound has {mol.getAtomCount()} atoms.")

factory = InChIGeneratorFactory.getInstance();
generator = factory.getInChIGenerator(mol);
if generator.getReturnStatus() == INCHI_RET.OKAY:
  inchiObj = generator.getInchi()
  print(f"This compound has this InChI: {inchiObj}")
  inchiKey = generator.getInchiKey()
  print(f"  and this InChIKey: {inchiKey}")

This compound has 29 atoms.
This compound has this InChI: InChI=1S/C23H22O6/c1-11(2)16-8-14-15(28-16)6-5-12-22(24)21-13-7-18(25-3)19(26-4)9-17(13)27-10-20(21)29-23(12)14/h5-7,9,16,20-21H,1,8,10H2,2-4H3/t16-,20-,21+/m1/s1
  and this InChIKey: JUVIOZPCNVVQFO-HBGVWJBISA-N


**Using the InChIKey to find the VHP4Safety Compound Wikidata record**

The below code examples uses the calculated InChIKey to do an exact match lookup with the SPARQL API.

In [None]:
# SPARQL endpoint URLs
compoundwikiEP = "https://compoundcloud.wikibase.cloud/query/sparql"

sparqlquery = '''
PREFIX wd: <https://compoundcloud.wikibase.cloud/entity/>
PREFIX wdt: <https://compoundcloud.wikibase.cloud/prop/direct/>

SELECT ?cmp ?cmpLabel ?inchiKey
WHERE {
  VALUES ?inchiKey { "''' + str(inchiKey) + '''" }
  ?cmp wdt:P10 ?inchiKey .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
'''

df = wdi_core.WDFunctionsEngine.execute_sparql_query(sparqlquery, endpoint=compoundwikiEP, as_dataframe=True)
df.loc[:,["cmp", "cmpLabel", "inchiKey"]]

Unnamed: 0,cmp,cmpLabel,inchiKey
0,https://compoundcloud.wikibase.cloud/entity/Q38,rotenone,JUVIOZPCNVVQFO-HBGVWJBISA-N


We can extract the Compound Wiki identifier with the following code:

In [None]:
vhpID = df.at[0,'cmp'][44:]
vhpID

'Q38'

**External Databases**

The VHP4Safety Compound Wikidata indexes a few databases not indexed by other resources or otherwise specific to toxicology. The follow query looks up the external identifiers.

In [None]:
sparqlquery = '''
PREFIX wd: <https://compoundcloud.wikibase.cloud/entity/>
PREFIX wdt: <https://compoundcloud.wikibase.cloud/prop/direct/>

SELECT ?propertyLabel ?identifier
WHERE {
  VALUES ?property { wd:P4 wd:P13 wd:P19 wd:P22 wd:P23 wd:P26 wd:P27 wd:P28 wd:P36 wd:P41 }
  ?property wikibase:directClaim ?identifierProp .
  OPTIONAL { wd:''' + vhpID + ''' ?identifierProp ?identifier }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
'''

identifiers = wdi_core.WDFunctionsEngine.execute_sparql_query(sparqlquery, endpoint=compoundwikiEP, as_dataframe=True)
identifiers.loc[:,["propertyLabel","identifier"]]

Unnamed: 0,propertyLabel,identifier
0,ToxBank Wiki,https://wiki.toxbank.net/wiki/Rotenone
1,PubChem CID,6758
2,xenobiotic metabolism pathway,WP5486
3,DSSTOX compound identifier,
4,CAS Registry Number,83-79-4
5,JRC Data Catalogue Term,rotenone
6,KEGG ID,
7,ChEBI ID,
8,AOP-Wiki Stressor ID,50
9,ChEMBL ID,


**Chemical properties**

The VHP4Safety Compound Wikidata also provides a few properties. This query fetches those:

In [None]:
sparqlquery = '''
PREFIX wd: <https://compoundcloud.wikibase.cloud/entity/>
PREFIX wdt: <https://compoundcloud.wikibase.cloud/prop/direct/>

SELECT ?propertyLabel ?value
WHERE {
  VALUES ?property { wd:P3 wd:P2 wd:P32 }
  ?property wikibase:directClaim ?valueProp .
  OPTIONAL { wd:''' + vhpID + ''' ?valueProp ?value }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
'''

properties = wdi_core.WDFunctionsEngine.execute_sparql_query(sparqlquery, endpoint=compoundwikiEP, as_dataframe=True)
properties.loc[:,["propertyLabel","value"]]

Unnamed: 0,propertyLabel,value
0,mass,394.4181
1,chemical formula,C₂₃H₂₂O₆
2,octanol-water partition coefficient,


**Information from Wikidata**

The VHP4Safety Compound Wiki is linked to Wikidata via `wd:P5` and Wikidata provides a lot more information which can be retrieved with their SPARQL API. We first get the Wikidata identifier:

In [None]:
sparqlquery = '''
PREFIX wd: <https://compoundcloud.wikibase.cloud/entity/>
PREFIX wdt: <https://compoundcloud.wikibase.cloud/prop/direct/>

SELECT ?wikidata
WHERE {
  wd:P5 wikibase:directClaim ?identifierProp .
  wd:''' + vhpID + ''' ?identifierProp ?wikidata .
}
'''

identifiers = wdi_core.WDFunctionsEngine.execute_sparql_query(sparqlquery, endpoint=compoundwikiEP, as_dataframe=True)
wikidataID = identifiers.at[0,'wikidata']
wikidataID

'Q412388'

Now that we know the Wikidata identifier, we can use this to get additional information. For example, Wikidata has many more external identifiers, which we can retrieve with SPARQL too:

In [None]:
sparqlquery = '''
PREFIX wd: <https://compoundcloud.wikibase.cloud/entity/>
PREFIX wdt: <https://compoundcloud.wikibase.cloud/prop/direct/>
PREFIX wid: <http://www.wikidata.org/entity/>
PREFIX widt: <http://www.wikidata.org/prop/direct/>

SELECT ?IdentifierLabel ?Value ?url
WHERE {
  wd:P5 wikibase:directClaim ?identifierProp .
  wd:''' + vhpID + ''' ?identifierProp ?wikidata .
  BIND (iri(CONCAT("http://www.wikidata.org/entity/", ?wikidata)) AS ?qid)
  SERVICE <https://query.wikidata.org/sparql> {
    ?qid ?IDdir ?Value .
    ?Identifier wikibase:directClaim ?IDdir ;
            widt:P31 wid:Q19833835 ;
            rdfs:label ?IdentifierLabel .
    FILTER ( lang(?IdentifierLabel) = 'en' )
    OPTIONAL {
      ?Identifier widt:P1630 ?formatterurl .
    }
    FILTER (?Identifier != wid:P233)
    FILTER (?Identifier != wid:P234)
    FILTER (?Identifier != wid:P2017)
    BIND(IRI(REPLACE(?formatterurl, '\\\\$1', str(?Value))) AS ?url).
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
'''

moreDatabases = wdi_core.WDFunctionsEngine.execute_sparql_query(sparqlquery, endpoint=compoundwikiEP, as_dataframe=True)
moreDatabases.loc[:,["IdentifierLabel","Value","url"]]

Unnamed: 0,IdentifierLabel,Value,url
0,ChEMBL ID,CHEMBL429023,https://www.ebi.ac.uk/chembl/compound_report_c...
1,UNII,03L9OT429T,https://gsrs.ncats.nih.gov/ginas/app/beta/subs...
2,RTECS number,DJ2800000,
3,KEGG ID,C07593,https://www.kegg.jp/entry/C07593
4,ChEBI ID,28201,https://www.ebi.ac.uk/chebi/searchId.do?chebiI...
5,ChemSpider ID,6500,https://www.chemspider.com/Chemical-Structure....
6,PubChem CID,6758,https://pubchem.ncbi.nlm.nih.gov/compound/6758
7,CAS Registry Number,83-79-4,https://commonchemistry.cas.org/detail?cas_rn=...
8,EC number,201-501-9,https://echa.europa.eu/information-on-chemical...
9,InChIKey,JUVIOZPCNVVQFO-HBGVWJBISA-N,https://www.ncbi.nlm.nih.gov/sites/entrez?cmd=...


Likewise, we can request physicochemical properties:

In [None]:
sparqlquery = '''
PREFIX wd: <https://compoundcloud.wikibase.cloud/entity/>
PREFIX wdt: <https://compoundcloud.wikibase.cloud/prop/direct/>
PREFIX wid: <http://www.wikidata.org/entity/>
PREFIX widt: <http://www.wikidata.org/prop/direct/>
PREFIX prov: <http://www.w3.org/ns/prov#>

SELECT ?propEntityLabel ?value ?unitsLabel ?source ?doi
WHERE {
  wd:P5 wikibase:directClaim ?identifierProp .
  wd:''' + vhpID + ''' ?identifierProp ?wikidata .
  BIND (iri(CONCAT("http://www.wikidata.org/entity/", ?wikidata)) AS ?qid)
  SERVICE <https://query.wikidata.org/sparql> {
    ?qid ?propp ?statement .
    ?statement a wikibase:BestRank ;
      ?proppsv [
        wikibase:quantityAmount ?value ;
        wikibase:quantityUnit ?units
      ] .
    OPTIONAL {
      ?statement prov:wasDerivedFrom/pr:P248 ?source .
      OPTIONAL { ?source wdt:P356 ?doi . }
    }
    ?property wikibase:claim ?propp ;
            wikibase:statementValue ?proppsv ;
            widt:P1629 ?propEntity ;
            widt:P31 wid:Q21077852 .
    ?propEntity rdfs:label ?propEntityLabel .
    FILTER ( lang(?propEntityLabel) = 'en' )
    ?units rdfs:label ?unitsLabel .
    FILTER ( lang(?unitsLabel) = 'en' )
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
'''

moreDatabases = wdi_core.WDFunctionsEngine.execute_sparql_query(sparqlquery, endpoint=compoundwikiEP, as_dataframe=True)
moreDatabases.loc[:,["propEntityLabel","value","unitsLabel","source"]]

Unnamed: 0,propEntityLabel,value,unitsLabel,source
0,mass density,1.27,gram per cubic centimetre,
1,melting point,330.0,degree Fahrenheit,
2,Immediately dangerous to life or health,2500.0,milligram per cubic metre,
3,time-weighted average concentration,5.0,milligram per cubic metre,
4,vapor pressure,4e-05,millimetre of mercury,
5,boiling point,215.0,degree Celsius,http://www.wikidata.org/entity/Q328
6,mass,394.142,dalton,http://www.wikidata.org/entity/Q278487


The last command may fail, depending on the available data. The SPARQL clearly returns a DOI, but panda removes empty columns, it seems, so depending on the data, the "doi" column may be there or not.