# GA4GH IPython Example Notebook

This notebook provides an overview of how to call a the GA4GH reference server from an IPython notebook.  Before running this notebook:
1. Install the ga4gh library
2. Launch an instance of the reference server on localhost "python server_dev.py"

## Connect to GA4GH Server

In [1]:
import ga4gh.client
baseURL = "http://localhost:8000"
client = ga4gh.client.HttpClient(baseURL)

Great! Now we can run calls to query ReferenceSets, References, Datasets, VariantSets, CallSets, Variants, ReadGroups, ReadGroupSets, & Reads

## Search/Get ReferenceSets

In [2]:
results = client.searchReferenceSets()
for result in results:
    print result.toJsonDict()
    referenceSetId = result.toJsonDict()['id']

{u'description': u'TODO', u'sourceURI': u'TODO', u'assemblyId': u'TODO', u'sourceAccessions': [], u'md5checksum': u'234ea63052f999c21c2bdb6f60e61038', u'isDerived': False, u'id': u'R1JDaDM4LXN1YnNldA==', u'ncbiTaxonId': 9606, u'name': None}


In [3]:
referenceSet = client.getReferenceSet(referenceSetId)
print referenceSet

ReferenceSet({"name": null, "sourceURI": "TODO", "assemblyId": "TODO", "sourceAccessions": [], "ncbiTaxonId": 9606, "isDerived": false, "id": "R1JDaDM4LXN1YnNldA==", "md5checksum": "234ea63052f999c21c2bdb6f60e61038", "description": "TODO"})


## Search/Get References 

In [4]:
results = client.searchReferences(referenceSetId)
for result in results:
    print result.toJsonDict()
    referenceId = result.toJsonDict()['id']

{u'name': u'1', u'sourceURI': u'http://www.ebi.ac.uk/ena/data/view/CM000663.2%26range=0-138395&display=fasta', u'sourceAccessions': [u'CM000663.2.subset'], u'sourceDivergence': None, u'length': 138395, u'ncbiTaxonId': 9606, u'isDerived': False, u'id': u'R1JDaDM4LXN1YnNldDox', u'md5checksum': u'bb07c91cda4645ad8e75e375e3d6e5eb'}
{u'name': u'2', u'sourceURI': u'http://www.ebi.ac.uk/ena/data/view/CM000664.2%26range=0-32403&display=fasta', u'sourceAccessions': [u'CM000664.2.subset'], u'sourceDivergence': None, u'length': 32403, u'ncbiTaxonId': 9606, u'isDerived': False, u'id': u'R1JDaDM4LXN1YnNldDoy', u'md5checksum': u'f513b7c19964e17092b55df262f62990'}
{u'name': u'3', u'sourceURI': u'http://www.ebi.ac.uk/ena/data/view/CM000665.2%26range=0-81617&display=fasta', u'sourceAccessions': [u'CM000665.2.subset'], u'sourceDivergence': None, u'length': 81617, u'ncbiTaxonId': 9606, u'isDerived': False, u'id': u'R1JDaDM4LXN1YnNldDoz', u'md5checksum': u'0680724fdd32e4dba0c14db5cef24f1e'}


In [5]:
reference = client.getReference(referenceId)
print reference

Reference({"name": "3", "sourceURI": "http://www.ebi.ac.uk/ena/data/view/CM000665.2%26range=0-81617&display=fasta", "sourceAccessions": ["CM000665.2.subset"], "sourceDivergence": null, "length": 81617, "md5checksum": "0680724fdd32e4dba0c14db5cef24f1e", "isDerived": false, "id": "R1JDaDM4LXN1YnNldDoz", "ncbiTaxonId": 9606})


In addition to fetching metadata about the reference, you can request the base sequence:

In [6]:
client.listReferenceBases(referenceId, start=10000, end=10100)

u'CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTCACCCTACCCTAACCCTAACCCTAACCCTCACCCTCACCCTCACTCAACC'

## Search/Get Dataset

In [7]:
results = client.searchDatasets()
for result in results:
    datasetId =  result.toJsonDict()['id']

In [8]:
dataset = client.getDataset(datasetId)
print dataset

Dataset({"description": null, "name": "1kg-p3-subset", "id": "MWtnLXAzLXN1YnNldA=="})


## Search/Get VariantSets

In [9]:
results = client.searchVariantSets(datasetId)
for result in results:
    variantSetId = result.toJsonDict()['id']

In [10]:
variantSet = client.getVariantSet(variantSetId)
print variantSet

VariantSet({"name": "mvncall", "referenceSetId": "", "id": "MWtnLXAzLXN1YnNldDptdm5jYWxs", "datasetId": "MWtnLXAzLXN1YnNldA==", "metadata": [{"info": {}, "description": "", "number": "1", "value": "VCFv4.1", "key": "version", "type": "String", "id": ""}, {"info": {}, "description": "Confidence interval around END for imprecise variants", "number": "2", "value": "", "key": "INFO.CIEND", "type": "Integer", "id": ""}, {"info": {}, "description": "Confidence interval around POS for imprecise variants", "number": "2", "value": "", "key": "INFO.CIPOS", "type": "Integer", "id": ""}, {"info": {}, "description": "Source call set.", "number": "1", "value": "", "key": "INFO.CS", "type": "String", "id": ""}, {"info": {}, "description": "End coordinate of this variant", "number": "1", "value": "", "key": "INFO.END", "type": "Integer", "id": ""}, {"info": {}, "description": "Imprecise structural variation", "number": "0", "value": "", "key": "INFO.IMPRECISE", "type": "Flag", "id": ""}, {"info": {}, 

## Search/Get Callset

Callsets apply to the samples within a dataset

In [11]:
results = client.searchCallSets(variantSetId)
for result in results:
    print result.toJsonDict()['name']
    callSetId =  result.toJsonDict()['id']

HG00096
HG00533
HG00534


In [12]:
callset = client.getCallSet(callSetId)
print callset

CallSet({"info": {}, "updated": 1444663564000, "name": "HG00534", "created": 1444663564000, "sampleId": "HG00534", "variantSetIds": ["MWtnLXAzLXN1YnNldDptdm5jYWxs"], "id": "MWtnLXAzLXN1YnNldDptdm5jYWxsOkhHMDA1MzQ="})


## Search/Get variants

In [13]:
results = client.searchVariantSets(datasetId)
for result in results:
    variantSetId = result.toJsonDict()['id']

In [14]:
variantset = client.getVariantSet(variantSetId)

In [15]:
print variantset

VariantSet({"name": "mvncall", "referenceSetId": "", "id": "MWtnLXAzLXN1YnNldDptdm5jYWxs", "datasetId": "MWtnLXAzLXN1YnNldA==", "metadata": [{"info": {}, "description": "", "number": "1", "value": "VCFv4.1", "key": "version", "type": "String", "id": ""}, {"info": {}, "description": "Confidence interval around END for imprecise variants", "number": "2", "value": "", "key": "INFO.CIEND", "type": "Integer", "id": ""}, {"info": {}, "description": "Confidence interval around POS for imprecise variants", "number": "2", "value": "", "key": "INFO.CIPOS", "type": "Integer", "id": ""}, {"info": {}, "description": "Source call set.", "number": "1", "value": "", "key": "INFO.CS", "type": "String", "id": ""}, {"info": {}, "description": "End coordinate of this variant", "number": "1", "value": "", "key": "INFO.END", "type": "Integer", "id": ""}, {"info": {}, "description": "Imprecise structural variation", "number": "0", "value": "", "key": "INFO.IMPRECISE", "type": "Flag", "id": ""}, {"info": {}, 

## Search/Get Readgroupsets

In [16]:
results = client.searchReadGroupSets(datasetId)
for result in results:
    results_json = result.toJsonDict()
    print results_json['name'], results_json['id']
    readGroupSetId = results_json['id']
    readGroupId = results_json['readGroups'][0]['id']

HG00096 MWtnLXAzLXN1YnNldDpIRzAwMDk2
HG00533 MWtnLXAzLXN1YnNldDpIRzAwNTMz
HG00534 MWtnLXAzLXN1YnNldDpIRzAwNTM0


In [17]:
readgroupset = client.getReadGroupSet(readGroupSetId)
print readgroupset

ReadGroupSet({"readGroups": [{"info": {}, "updated": 1444665650769, "predictedInsertSize": 486, "stats": {"unalignedReadCount": -1, "alignedReadCount": -1, "baseCount": null}, "description": "SRP001293", "created": 1444665650769, "programs": [{"commandLine": "bwa index -a bwtsw $reference_fasta", "prevProgramId": null, "id": "bwa_index", "version": "0.5.9-r16", "name": "bwa"}, {"commandLine": "bwa aln -q 15 -f $sai_file $reference_fasta $fastq_file\tPP:bwa_index", "prevProgramId": null, "id": "bwa_aln_fastq", "version": "0.5.9-r16", "name": "bwa"}, {"commandLine": "bwa sampe -a 1458 -r $rg_line -f $sam_file $reference_fasta $sai_file(s) $fastq_file(s)\tPP:bwa_aln_fastq", "prevProgramId": null, "id": "bwa_sam", "version": "0.5.9-r16", "name": "bwa"}, {"commandLine": "samtools view -bSu $sam_file | samtools sort -n -o - samtools_nsort_tmp | samtools fixmate /dev/stdin /dev/stdout | samtools sort -o - samtools_csort_tmp | samtools fillmd -u - $reference_fasta > $fixed_bam_file\tPP:bwa_sam

## Get ReadGroups

In [18]:
readgroup = client.getReadGroup(readGroupId)
print readgroup

ReadGroup({"info": {}, "updated": 1444665650769, "predictedInsertSize": 486, "stats": {"unalignedReadCount": -1, "alignedReadCount": -1, "baseCount": null}, "description": "SRP001293", "created": 1444665650769, "programs": [{"commandLine": "bwa index -a bwtsw $reference_fasta", "prevProgramId": null, "id": "bwa_index", "version": "0.5.9-r16", "name": "bwa"}, {"commandLine": "bwa aln -q 15 -f $sai_file $reference_fasta $fastq_file\tPP:bwa_index", "prevProgramId": null, "id": "bwa_aln_fastq", "version": "0.5.9-r16", "name": "bwa"}, {"commandLine": "bwa sampe -a 1458 -r $rg_line -f $sam_file $reference_fasta $sai_file(s) $fastq_file(s)\tPP:bwa_aln_fastq", "prevProgramId": null, "id": "bwa_sam", "version": "0.5.9-r16", "name": "bwa"}, {"commandLine": "samtools view -bSu $sam_file | samtools sort -n -o - samtools_nsort_tmp | samtools fixmate /dev/stdin /dev/stdout | samtools sort -o - samtools_csort_tmp | samtools fillmd -u - $reference_fasta > $fixed_bam_file\tPP:bwa_sam", "prevProgramId":

## Get Reads

In [19]:
results = client.searchReads([readGroupId], referenceId)
for i, result in enumerate(results):
    readId = result.toJsonDict()['id']
print "%d reads in readgroup id = %s on chrom id = %s" % (i, readGroupId, referenceId)

997 reads in readgroup id = MWtnLXAzLXN1YnNldDpIRzAwNTM0OkVSUjAyMDIzOA== on chrom id = R1JDaDM4LXN1YnNldDoz


In [20]:
print result

ReadAlignment({"info": {"MD": ["15A75"], "NM": ["1"], "AM": ["37"], "RG": ["ERR020238"], "MQ": ["60"], "BQ": ["@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@"], "SM": ["37"], "X0": ["1"], "X1": ["0"], "XT": ["U"]}, "duplicateFragment": false, "readGroupId": "MWtnLXAzLXN1YnNldDpIRzAwNTM0OkVSUjAyMDIzOA==", "alignedQuality": [35, 37, 40, 39, 39, 41, 41, 39, 42, 43, 42, 42, 44, 43, 41, 44, 40, 42, 37, 43, 42, 40, 41, 41, 41, 43, 41, 42, 38, 43, 41, 41, 44, 44, 42, 44, 39, 42, 42, 44, 43, 43, 42, 45, 42, 43, 43, 42, 42, 43, 42, 42, 39, 44, 44, 44, 40, 44, 43, 42, 42, 43, 42, 44, 43, 42, 43, 44, 41, 41, 43, 42, 41, 41, 41, 42, 41, 40, 41, 42, 41, 40, 40, 40, 39, 41, 38, 38, 37, 37, 35], "failedVendorQualityChecks": false, "fragmentName": "ERR020238.14295398", "readNumber": 0, "properPlacement": true, "fragmentId": "TODO", "supplementaryAlignment": false, "numberReads": 2, "fragmentLength": -477, "secondaryAlignment": false, "alignedSequence": "CAA