The actual Jupyter Notebook could be found [here](https://github.com/biothings/JSON-LD_BioThings_API_DEMO/blob/master/src/Demo%20for%20Data%20Discrepancy%20Check.ipynb)

### This code demonstrate how data discrepancy check is done using JSON-LD

### Requirements

1. Download python package biothings_client. **biothings_client** is an easy-to-use Python wrapper to access any Biothings.api-based backend service, including MyGene.info, MyVariant.info, etc. It could be downloaded at [pypi](https://pypi.python.org/pypi/biothings-client/0.1.1) or installed using **'pip install biothings_client'**. In this code demo, we only use functions in **biothings_client** related to **MyVariant.info**.
2. Clone the demo repo and run the code under **'src'** folder. **JSON-LD_BioThings_API_DEMO** Repo stores all codes used for the paper. The repo could be found at [github](https://github.com/biothings/JSON-LD_BioThings_API_DEMO). In this demo code, it uses python code **'jsonld_processor'**. 

**jsonld_processor** is a collection of json-ld related functions. It could be found at the [repo](https://github.com/biothings/JSON-LD_BioThings_API_DEMO/blob/master/src/jsonld_processor.py). Functions used in this code demo includes **nquads transform** which takes a json-ld doc and transforms it into nquads format. And also **fetch_value_by_uri** which takes an URI, e.g. "http://identifiers.org/dbsnp/" which is the URI for rsid, and return all values in the json-ld doc corresponding to the URI.

The output of this code is all hgvs ids having rsid discrepancy issues. 

In [1]:
from biothings_client import get_client
from jsonld_processor import nquads_transform, fetch_value_by_uri, load_context, flatten_doc
import csv

In [2]:
# count the number of test print ids
test_print = 0

In [3]:
############################################################################
# Please note looping through all docs in MyVariant.info would take a long
# time. Thus, for demo purpose, we set the limit to the first 200,000 docs.
# You could change the value of total_docs to scan more docs. You could find 
# more hgvs_ids in the output csv file 'rsid_discrepancy_check.csv'.
############################################################################
total_docs = 200000
with open('rsid_discrepancy_check.csv', 'w') as csvfile:
    # count the total number of docs scanned
    cnt = 0
    # json-ld context file for MyVariant.info
    context = load_context('myvariant.info')
    # write the header for csv file
    fieldnames = ['hgvs_id']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    # get all docs in MyVariant.info
    mv = get_client('variant')
    data = mv.query(q='__all__', fetch_all=True)
    # loop through each doc, apply jsonld context 
    for doc in data:
        cnt += 1
        if cnt % 50000 ==0:
            print('{} docs have been scanned'.format(cnt))
        # only these sources contain rsid info, so only apply json-ld when one or more these sources appear
        if ('dbnsfp' or 'gwassnps' or 'mutdb' or 'clinvar' or 'dbsnp' or 'evs' or 'grasp') in doc:
            try:
                doc = flatten_doc(doc)
                doc.update(context)
                nquads_doc = nquads_transform(doc)
                rsid = fetch_value_by_uri(nquads_doc, "http://identifiers.org/dbsnp/")
                if rsid and type(rsid) == list:
                    writer.writerow({'hgvs_id': doc['_id']})
                    # only print the first 10 docs having rsid discrepancy issue
                    rsid_dic = {}
                    for _key in ['dbsnp.rsid', 'dbnsfp.rsid', 'clinvar.rsid', 'gwassnps.rsid', 'mutdb.rsid', 'evs.rsid', 'grasp.rsid']:
                        if _key in doc:
                            rsid_dic[_key.split('.')[0]] = doc[_key]
                    print_message = doc['_id'] + ': '
                    for k, v in rsid_dic.items():
                        print_message += (k + ' reports ' + v + '; ')
                    print(print_message)
            except:
                #print('error id {}'.format(doc['_id']))
                continue
        else:
            continue
        if cnt > total_docs:
            break

Fetching 424519520 variant(s) . . .
chr15:g.28228629C>A: dbsnp reports rs778045887; dbnsfp reports rs778045887; mutdb reports rs147218966; 
chr8:g.12043908A>G: dbsnp reports rs201884366; dbnsfp reports rs201884366; mutdb reports rs2409919; 
50000 docs have been scanned
chr11:g.5246838T>A: dbsnp reports rs33996892; clinvar reports rs33996892; dbnsfp reports rs33996892; mutdb reports rs121909829; 
100000 docs have been scanned
150000 docs have been scanned
chr10:g.17145142G>C: dbsnp reports rs2228053; dbnsfp reports rs2228053; evs reports rs2228053; mutdb reports rs149812870; 
chrX:g.12712508G>A: dbsnp reports rs779596855; dbnsfp reports rs779596855; mutdb reports rs148666498; 
200000 docs have been scanned
