**idNumericalDuplication.ipynb**

Duplicate use of numerical component of EDAM concept ID.

**Documentation:** https://github.com/edamontology/edamverify/blob/master/docs/iDNumericalDuplication.md

Note: The script may report errors where, in fact, the numerical ID component is *not* duplicated. For example where subsets have been used erroneously. For this reason, subsets are included in the diagnostic output.

Set constants for script return values. Load EDAM_dev.owl from GitHub into an RDF graph.

In [5]:
import sys
from rdflib import ConjunctiveGraph, Namespace

# Constants for script error reporting as per https://github.com/edamontology/edamverify.
NOERR = "NOERR"
INFO  = "INFO"
WARN  = "WARN"
ERROR = "ERROR"



#Load EDAM_dev.owl from GitHub into an RDF graph.
print("Loading graph ...", end="")
g = ConjunctiveGraph()
g.load('https://raw.githubusercontent.com/edamontology/edamontology/master/EDAM_dev.owl', format='xml')
# g.load('EDAM_dev.owl')
g.bind('edam', Namespace('http://edamontology.org#'))
print("done!")


Loading graph ...done!


Define SPARQL query to extract ID, term and subset of all concepts. Run the query.

**NB:** EDAM subsets are defined using XML *attributes* in the EDAM.OWL (rdf/xml) format file, thus it is not possible to retrieve their *literal values* (e.g. for purposes of filtering) by SPARQL query - so far as I'm aware.  Such filtering is done in the Python code following this block.

In [6]:
# Compile SPARQL query
query_term = """
SELECT ?id ?term ?subset WHERE
{
?id rdfs:label ?term .
?id oboInOwl:inSubset ?subset . 
}
"""

# Declare hash tables for results
ids = {}
numerical_ids = {}
terms = {}
subsets = {}

# Run SPARQL query and collate results
errfound = False    
report = list()
results = g.query(query_term)

Analyse results of query.

**NOTE**
1. The numerical component of a concept ID is taken to be everything after the first occurrence of underscore ('_') character.
2. Concepts which are not defined to be in one of the "topics", "operations", "data" or "formats" subsets are ignored (not checked).

In [7]:
report.append("Suspected duplication of the numerical component of the concept ID for these concepts:")

for r in results :
#    print(str(r['id']), str(r['term']), str(r['subset']))
    
    id     = str(r['id'])
    term   = str(r['term']) 
    subset = str(r['subset']) 
    
    # Discard concepts in irrelevant subsets
    if subset != "http://purl.obolibrary.org/obo/edam#topics" \
            and subset != "http://purl.obolibrary.org/obo/edam#operations" \
            and subset != "http://purl.obolibrary.org/obo/edam#data" \
            and subset != "http://purl.obolibrary.org/obo/edam#formats":
        continue
        
    # Populate hash tables
    pos = id.rfind("_")    
    numerical_id = id[pos+1:]
    
    # Check for duplicate numerical ID
    if numerical_id in numerical_ids:
        # Suspected duplicate found
        errfound = True
        report.append(id +  ' (' + term + ')' + " in subset:" + subset 
                      + " ::: " 
                      + ids[numerical_id] + ' (' + terms[numerical_id] + ')' + " in subset:" + subsets[numerical_id])
    else:
        # No duplicate found
        numerical_ids[numerical_id] = True
        ids[numerical_id] = id
        terms[numerical_id] = term
        subsets[numerical_id] = subset

Write report and return approriate value.

In [8]:
# Return exit code (raises exception)
if errfound:
    print('"Test name": ' + '"idNumericalDuplication", ' +\
          '"Status": ' + '"' + ERROR + '", ' +\
          '"Reason": ' + '"' + '\n'.join(report) + '"')
    # print("\n".join(report))
    # sys.exit(ERROR)
else:
    print('"Test name": ' + '"idNumericalDuplication", ' +\
          '"Status": ' + '"' + NOERR + '", ' +\
          '"Reason": ' + '"-"')
    # print("No issues found.")
    # sys.exit(NOERR)
    

"Test name": "idNumericalDuplication", "Status": "ERROR", "Reason": "Suspected duplication of the numerical component of the concept ID for these concepts:
http://edamontology.org/operation_3456 (Rigid body refinement) in subset:http://purl.obolibrary.org/obo/edam#operations ::: http://edamontology.org/operation_3456 (Rigid body refinement) in subset:http://purl.obolibrary.org/obo/edam#data
http://edamontology.org/operation_3455 (Molecular replacement) in subset:http://purl.obolibrary.org/obo/edam#operations ::: http://edamontology.org/operation_3455 (Molecular replacement) in subset:http://purl.obolibrary.org/obo/edam#data
http://edamontology.org/operation_3454 (Phasing) in subset:http://purl.obolibrary.org/obo/edam#data ::: http://edamontology.org/operation_3454 (Phasing) in subset:http://purl.obolibrary.org/obo/edam#operations
http://edamontology.org/format_3873 (HDF) in subset:http://purl.obolibrary.org/obo/edam#data ::: http://edamontology.org/format_3873 (HDF) in subset:http://pu