**wikipediaLinkMissing.ipynb**

Wikipedia link missing from EDAM Topic concept.

**Documentation:** https://github.com/edamontology/edamverify/blob/master/docs/wikiepediaLinkMissing.md

**NB.1:** Wikipedia links may be specified in the following ways:
* ``<rdfs:seeAlso>http://en.wikipedia.org/wiki/List_of_file_formats</rdfs:seeAlso>``        
* ``<rdfs:seeAlso rdf:resource="https://en.wikipedia.org/wiki/Information_Hyperlinked_over_Proteins"/>``       


**NB.2: - Running the notebook**
The directory containing the ``EDAM_dev.owl`` file must be defined by ``EDAM_PATH`` environment variable.

The script requires the test to be run from a subdirectory of ``EDAM_PATH`` (hence ``'../EDAM_dev.owl'`` below)

Set constants for script return values. Load EDAM_dev.owl from GitHub into an RDF graph.

In [28]:
import os
from rdflib import ConjunctiveGraph, Namespace
import json

# Constants for script error reporting as per https://github.com/edamontology/edamverify.
NOERR = "NOERR"
INFO  = "INFO"
WARN  = "WARN"
ERROR = "ERROR"

#Load EDAM_dev.owl from GitHub into an RDF graph.
print("Loading graph ...", end="")
g = ConjunctiveGraph()
g.load(os.environ.get('EDAM_PATH', '../EDAM_dev.owl'), format='xml')
# g.load('https://raw.githubusercontent.com/edamontology/edamontology/master/EDAM_dev.owl', format='xml')
# g.load('EDAM_dev.owl')
g.bind('edam', Namespace('http://edamontology.org#'))
print("done!")

Loading graph ...done!


Define SPARQL query to extract ID, term, and (if available) seealso and deprecated fields of all Topic concepts. Run the query.

**NB:** Use ``"/topic_"`` in query to avoid detection of http://edamontology.org/is_topic_of

In [29]:
# Compile SPARQL query
query_term = """
SELECT ?id ?term ?seealso ?deprecated WHERE
{
?id rdfs:label ?term .
OPTIONAL {?id rdfs:seeAlso ?seealso .}
OPTIONAL {?id owl:deprecated ?deprecated .}
FILTER regex(str(?id), "/topic_")
}
"""
# Declare hash tables for results
ids = {}
terms = {}
errs = {}

# Run SPARQL query and collate results
errfound = False    
report = list()
results = g.query(query_term)

Analyse results of query.

In [30]:
for r in results :
    
    id      = str(r['id'])
    term    = str(r['term'])
    seealso = str(r['seealso']) 
    deprecated = str(r['deprecated'])

    # Skip deprecated concepts
    if deprecated == "true":
        continue
            
  # print(id, "(", term, ")   ", seealso, "   ", str(r['seealso']))

    # id is assigned to both the key and value of the 'ids' hash table
    # Later on, just the key is used
    ids[id] = id
    terms[id] = term

    # Concepts can have more than one seeAlso property, not all of which include
    # a Wikipedia URL - deal with that.
    if id not in errs:
        errs[id] = True
      
    # "None" string is return from SPARQL query where value was not found
    if seealso != "None" and "wikipedia" in seealso:
        errs[id] = False
    
for key in ids:
    if errs[key]:
        errfound = True
        report.append("Missing wikipedia link ::: " + key +  ' (' + terms[key] + ')')

Write report and return approriate value.

In [31]:
report_obj = {}
report_obj['test_name'] = 'wikipediaLinkMissing'
report_obj['comment'] = 'Missing wikepedia link for one or more Topic concepts.'

if errfound:
    report_obj['status'] = INFO
    report_obj['reason'] = report
else:
    report_obj['status'] = NOERR

report_json = json.dumps(report_obj, indent=4)
print(report_json)


{
    "test_name": "wikipediaLinkMissing",
    "reason": [
        "Missing wikipedia link ::: http://edamontology.org/topic_3339 (Microbial collection)",
        "Missing wikipedia link ::: http://edamontology.org/topic_3524 (Simulation experiment)",
        "Missing wikipedia link ::: http://edamontology.org/topic_3374 (Biotherapeutics)",
        "Missing wikipedia link ::: http://edamontology.org/topic_3511 (Nucleic acid sites, features and motifs)",
        "Missing wikipedia link ::: http://edamontology.org/topic_0632 (Probes and primers)",
        "Missing wikipedia link ::: http://edamontology.org/topic_3393 (Quality affairs)",
        "Missing wikipedia link ::: http://edamontology.org/topic_0736 (Protein folds and structural domains)",
        "Missing wikipedia link ::: http://edamontology.org/topic_0160 (Sequence sites, features and motifs)",
        "Missing wikipedia link ::: http://edamontology.org/topic_1775 (Function analysis)",
        "Missing wikipedia link ::: http: