# Biomedical Data Bases, 2020-2021
### Programmatic Access to Databases
These are notes by prof. Davide Salomoni (d.salomoni@unibo.it) for the Biomedical Data Base course at the University of Bologna, academic year 2020-2021.

## Running an external script from Python

Here we demonstrate the use of the _subprocess_ module.

In [1]:
# An example of running an external program (here: 'ls -l') from Python
import subprocess

sp = subprocess.run('ls -l', shell=True, capture_output=True, text=True)
print(sp.stdout)

total 12668
drwxr-xr-x 15 jovyan users     480 Dec 22 12:46 BDB_2021
-rw-r--r--  1 jovyan users   18002 Jan  3 10:08 COVID-19-italy-only.xlsx
-rw-r--r--  1 jovyan users 4044526 Dec 22 17:01 COVID-19-sample-BDB2022.csv
-rw-r--r--  1 jovyan users 5570759 Dec 22 13:53 COVID-19-sample-BDB2022.xlsx
-rw-r--r--  1 jovyan users     484 Jan  4 17:40 dump.rdb
-rw-r--r--  1 jovyan users  186278 Aug  7 13:02 Esempi di web scraping.ipynb
-rw-r--r--  1 jovyan users   83107 Aug 10 12:30 Esempi di web scraping - solo codice.ipynb
-rw-r--r--  1 jovyan users  205261 Dec 27 10:13 excel-analysis.ipynb
-rw-r--r--  1 jovyan users  884736 Jan  4 12:07 gubbio_env_2018_custom.sqlite
-rw-r--r--  1 jovyan users  442368 Jan  4 10:15 gubbio_env_2018.sqlite
-rw-r--r--  1 jovyan users  110296 Jan  3 10:09 pandas-examples.ipynb
-rw-r--r--  1 jovyan users  228250 Jan  6 12:28 programmatic_access.ipynb
-rw-r--r--  1 jovyan users   61232 Jan  5 10:02 RDBMS_sample.png
-rw-r--r--  1 jovyan users   19202 Jan  5 10:08 redis

### Running an external program and processing its output

Suppose we have a python program implementing an iterative method to compute the square root of that number, according to the following algorithm (called the _Heron's method_):

$$
x_0 = 1
\\
x_{n+1} = \frac{1}{2} (x_n + \frac{S}{x_n})
\\
\sqrt{S} = \lim_{n \to \infty} x_n
$$

If the program is called _sqrt_iterative.py_ (find it in the BDB github repo, or write one yourself) and expects in input the number for which we want to compute the square root and the number of iterations to perform, we could call it and check its output from another python program like this:

In [2]:
import subprocess
import math

number = 23941
iterations = 9

sp = subprocess.run('python3 sqrt_iterative.py %s %s' % (number, iterations), shell=True, capture_output=True, text=True)
result = float(sp.stdout.strip()) # what happens if the result cannot be converted to a float?

print('After %d iterations, the square root of %d is %f' % (iterations, number, result))
print('The difference with math.sqrt(%d) is %f' % (number, result-math.sqrt(number)))

After 9 iterations, the square root of 23941 is 155.142763
The difference with math.sqrt(23941) is 0.413968


## Using the _requests_ module

In [3]:
# test the requests module querying Google
import requests
res = requests.get('http://www.google.com')
print(res)

<Response [200]>


In [4]:
print(res.status_code)

200


In [5]:
# if the requests call succeeded, print the text that was returned.
res = requests.get('http://www.google.com')
if res.status_code == 200:
    print(res.text)

<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="it"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="VzZE3HxvmIY8ZNqaZZk+tw==">(function(){window.google={kEI:'cw7XYdnAI7TA5OUP_ai08AI',kEXPI:'0,1302536,56873,6058,207,4804,2316,383,246,5,1354,4013,1237,1122516,1197698,703,380089,16115,28684,17572,4858,1362,284,9007,3026,17582,4020,978,13228,3847,4192,6430,21823,919,5080,1593,1279,2213,529,149,1103,840,1983,213,4101,109,3405,606,2023,1777,520,6342,8328,3227,2845,7,12354,5096,7540,4084,4697,907,2,941,2614,3783,9359,3,576,6460,148,13975,4,1528,656,1648,6462,577,4683,8588,7038,2726,2038,2658,6701,656,30,13628,2305,18918,652,1869,3279,2557,4094,17,3123,4,908,3,3541,1,16524,283,38,874,5992,18443,2,3036,10986,1931,5589,744,5852,9321,1142,1160,1267,4412,1021,2377,2722,18260,2,1,13,5506,2240,2125,2443,2578,3678,2957,

### Remember to check the status code

In [6]:
r = requests.get('https://github.com/timelines.json')
print(r.status_code)

410


In [7]:
print(r.text)

{"message":"Hello there, wayfaring stranger. If you’re reading this then you probably didn’t see our blog post a couple of years back announcing that this API would go away: http://git.io/17AROg Fear not, you should be able to get what you need from the shiny new Events API instead.","documentation_url":"https://docs.github.com/v3/activity/events/#list-public-events-performed-by-a-user"}


### Query PDB using REST calls

Note that when the output is returned in JSON format we can easily parse it using normal Python dictionaries and lists.

In [8]:
# query the PDB using the REST API. It returns JSON output.
r = requests.get('https://data.rcsb.org/rest/v1/core/entry/4GYD')

# convert the json return value to a Python dictionary
data = r.json()

# check it is indeed a dictionary
type(data)

dict

In [9]:
# since 'data' is a dictionary, check what are its keys:
data.keys()

dict_keys(['audit_author', 'cell', 'citation', 'diffrn', 'diffrn_detector', 'diffrn_radiation', 'diffrn_source', 'entry', 'exptl', 'exptl_crystal', 'exptl_crystal_grow', 'pdbx_audit_revision_category', 'pdbx_audit_revision_details', 'pdbx_audit_revision_group', 'pdbx_audit_revision_history', 'pdbx_audit_revision_item', 'pdbx_database_related', 'pdbx_database_status', 'pdbx_vrpt_summary', 'rcsb_accession_info', 'rcsb_entry_container_identifiers', 'rcsb_entry_info', 'rcsb_primary_citation', 'refine', 'refine_hist', 'refine_ls_restr', 'reflns', 'reflns_shell', 'software', 'struct', 'struct_keywords', 'symmetry', 'rcsb_id'])

In [10]:
# get info from the 'cell' key:
data['cell']

{'angle_alpha': 90.0,
 'angle_beta': 90.0,
 'angle_gamma': 90.0,
 'length_a': 77.721,
 'length_b': 79.803,
 'length_c': 80.154,
 'zpdb': 24}

In [11]:
# get info for polymer entity data, providing PDB ID and polymer ID:
r = requests.get('https://data.rcsb.org/rest/v1/core/polymer_entity/4GYD/1')
data = r.json()
data.keys()

dict_keys(['rcsb_cluster_membership', 'entity_poly', 'entity_src_gen', 'rcsb_entity_host_organism', 'rcsb_entity_source_organism', 'rcsb_polymer_entity', 'rcsb_polymer_entity_align', 'rcsb_polymer_entity_annotation', 'rcsb_polymer_entity_container_identifiers', 'rcsb_polymer_entity_feature', 'rcsb_polymer_entity_feature_summary', 'rcsb_polymer_entity_name_com', 'rcsb_id', 'rcsb_cluster_flexibility', 'rcsb_latest_revision'])

In [12]:
# see what's inside the 'entity_poly' key:
data['entity_poly']

{'nstd_linkage': 'no',
 'nstd_monomer': 'no',
 'pdbx_seq_one_letter_code': 'ADSVNGAKIFSANCASCHAGGKNLVQAQKTLKKADLEKYGMYSAEAIIAQVTNGKNAMPAFKGRLKPEQIEDVAAYVLGKADADWK',
 'pdbx_seq_one_letter_code_can': 'ADSVNGAKIFSANCASCHAGGKNLVQAQKTLKKADLEKYGMYSAEAIIAQVTNGKNAMPAFKGRLKPEQIEDVAAYVLGKADADWK',
 'pdbx_strand_id': 'A,B,C,D,E,F',
 'rcsb_artifact_monomer_count': 0,
 'rcsb_conflict_count': 0,
 'rcsb_deletion_count': 0,
 'rcsb_entity_polymer_type': 'Protein',
 'rcsb_insertion_count': 0,
 'rcsb_mutation_count': 0,
 'rcsb_non_std_monomer_count': 0,
 'rcsb_sample_sequence_length': 86,
 'type': 'polypeptide(L)'}

In [13]:
# get annotations
r2 = requests.get('https://data.rcsb.org/rest/v1/core/pubmed/4GYD')
data2 = r2.json()
data2.keys()

dict_keys(['rcsb_id', 'rcsb_pubmed_container_identifiers', 'rcsb_pubmed_doi', 'rcsb_pubmed_abstract_text', 'rcsb_pubmed_affiliation_info', 'rcsb_pubmed_mesh_descriptors', 'rcsb_pubmed_mesh_descriptors_lineage'])

In [14]:
# this is the pubmed abstract:
data2['rcsb_pubmed_abstract_text']

'The rapid transfer of electrons in the photosynthetic redox chain is achieved by the formation of short-lived complexes of cytochrome b6f with the electron transfer proteins plastocyanin and cytochrome c6. A balance must exist between fast intermolecular electron transfer and rapid dissociation, which requires the formation of a complex that has limited specificity. The interaction of the soluble fragment of cytochrome f and cytochrome c6 from the cyanobacterium Nostoc sp. PCC 7119 was studied using NMR spectroscopy and X-ray diffraction. The crystal structures of wild type, M58H and M58C cytochrome c6 were determined. The M58C variant is an excellent low potential mimic of the wild type protein and was used in chemical shift perturbation and paramagnetic relaxation NMR experiments to characterize the complex with cytochrome f. The interaction is highly dynamic and can be described as a pure encounter complex, with no dominant stereospecific complex. Ensemble docking calculations and 

### Chemical component

In [15]:
r = requests.get('https://data.rcsb.org/rest/v1/core/chemcomp/CFF')
data = r.json()

In [16]:
data.keys()

dict_keys(['chem_comp', 'pdbx_chem_comp_audit', 'pdbx_chem_comp_descriptor', 'pdbx_chem_comp_identifier', 'rcsb_chem_comp_annotation', 'rcsb_chem_comp_container_identifiers', 'rcsb_chem_comp_descriptor', 'rcsb_chem_comp_info', 'rcsb_chem_comp_related', 'rcsb_chem_comp_synonyms', 'rcsb_chem_comp_target', 'rcsb_id'])

In [17]:
data['chem_comp']

{'formula': 'C8 H10 N4 O2',
 'formula_weight': 194.191,
 'id': 'CFF',
 'name': 'CAFFEINE',
 'pdbx_ambiguous_flag': 'N',
 'pdbx_formal_charge': 0,
 'pdbx_initial_date': '2000-05-16T00:00:00+0000',
 'pdbx_modified_date': '2020-06-17T00:00:00+0000',
 'pdbx_processing_site': 'RCSB',
 'pdbx_release_status': 'REL',
 'three_letter_code': 'CFF',
 'type': 'non-polymer'}

### Drug Bank

In [18]:
r = requests.get('https://data.rcsb.org/rest/v1/core/drugbank/CFF')
data = r.json()
data.keys()

dict_keys(['drugbank_container_identifiers', 'drugbank_info', 'drugbank_target'])

In [19]:
data['drugbank_info'].keys()

dict_keys(['atc_codes', 'brand_names', 'cas_number', 'description', 'drug_categories', 'drug_groups', 'drugbank_id', 'indication', 'name', 'synonyms'])

In [20]:
data['drugbank_info']['description']

'Caffeine is a drug of the methylxanthine class used for a variety of purposes, including certain respiratory conditions of the premature newborn, pain relief, and to combat drowsiness. Caffeine is similar in chemical structure to [Theophylline] and [Theobromine].[A187691,L9851] It can be sourced from coffee beans, but also occurs naturally in various teas and cacao beans, which are different than coffee beans.[T716] Caffeine is also used in a variety of cosmetic products and can be administered topically, orally, by inhalation, or by injection.[T716]  The caffeine citrate injection, used for apnea of the premature newborn, was initially approved by the FDA in 1999.[L9863] According to an article from 2017, more than 15 million babies are born prematurely worldwide. This correlates to about 1 in 10 births. Premature birth can lead to apnea and bronchopulmonary dysplasia, a condition that interferes with lung development and may eventually cause asthma or early onset emphysema in those 

In [21]:
data['drugbank_info']['indication']

'Caffeine is indicated for the short term treatment of apnea of prematurity in infants and off label for the prevention and treatment of bronchopulmonary dysplasia caused by premature birth.[T716,L9851] In addition, it is indicated in combination with sodium benzoate to treat respiratory depression resulting from an overdose with CNS depressant drugs.[L9899] Caffeine has a broad range of over the counter uses, and is found in energy supplements, athletic enhancement products, pain relief products, as well as cosmetic products.[T716,L9854,L9872]'

### Processing multiple files

In [22]:
protein_ids = ['4GYD', '4H0J', '4H0K']

protein_dict = dict()
for protein in protein_ids:
    r = requests.get('https://data.rcsb.org/rest/v1/core/entry/%s' % protein)
    data = r.json()
    protein_dict[protein] = data['cell']

# now print e.g. length_a, length_b and length_c for all the proteins:
for (key,value) in protein_dict.items():
    print('Protein %s: a=%f, b=%f, c=%f' % (key, value['length_a'], value['length_b'], value['length_c']))

Protein 4GYD: a=77.721000, b=79.803000, c=80.154000
Protein 4H0J: a=78.823000, b=80.157000, c=80.147000
Protein 4H0K: a=60.370000, b=60.370000, c=95.370000


### Getting sequence data in FASTA format

Note that in this case the output is returned as regular text, i.e. it is not JSON-formatted.

In [23]:
# print FASTA data for some proteins
protein_ids = ['4GYD', '4H0J', '4H0K']

for protein in protein_ids:
    r = requests.get('https://www.rcsb.org/fasta/entry/%s/download' % protein)
    print(r.text)

>4GYD_1|Chains A, B, C, D, E, F|Cytochrome c6|Nostoc (103690)
ADSVNGAKIFSANCASCHAGGKNLVQAQKTLKKADLEKYGMYSAEAIIAQVTNGKNAMPAFKGRLKPEQIEDVAAYVLGKADADWK

>4H0J_1|Chains A, B, C, D, E, F|Cytochrome c6|Nostoc (103690)
ADSVNGAKIFSANCASCHAGGKNLVQAQKTLKKADLEKYGMYSAEAIIAQVTNGKNACPAFKGRLKPEQIEDVAAYVLGKADADWK

>4H0K_1|Chains A, B|Cytochrome c6|Nostoc (103690)
ADSVNGAKIFSANCASCHAGGKNLVQAQKTLKKADLEKYGMYSAEAIIAQVTNGKNAHPAFKGRLKPEQIEDVAAYVLGKADADWK



## The PDB Search API

In [24]:
# a BLAST-like example using the PDB search API
fasta = "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLPARTVETRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMNCKCVIS"
my_query = '''{
    "query": {
        "type" : "terminal",
        "service" : "sequence",
        "parameters" : {
            "evalue_cutoff" : 1,
            "identity_cutoff" : 0.9,
            "target" : "pdb_protein_sequence",
            "value" : "%s"
        }
    },
    "request_options" : {
        "scoring_strategy" : "sequence"
    },
    "return_type" : "polymer_entity"
}''' % fasta
r = requests.get('http://search.rcsb.org/rcsbsearch/v1/query?json=%s' % requests.utils.requote_uri(my_query))
j = r.json()

In [25]:
# these are keys of the dictionary:
print(j.keys())

dict_keys(['query_id', 'result_type', 'total_count', 'result_set'])


In [26]:
# let's print the results:
print("We got %s matches" % j['total_count'])
print("The first %s results follow:" % len(j['result_set']))
for item in j['result_set']:
    print(item['identifier'], "score =", item['score'])

We got 438 matches
The first 10 results follow:
4Q21_1 score = 1.0
5X9S_1 score = 0.647887323943662
6Q21_1 score = 0.4225352112676056
1IOZ_1 score = 0.4225352112676056
1AA9_1 score = 0.4225352112676056
1Q21_1 score = 0.4225352112676056
2Q21_1 score = 0.36619718309859156
6AMB_1 score = 0.3380281690140845
6KYH_2 score = 0.30985915492957744
1LFD_2 score = 0.29577464788732394


In [27]:
# a sequence motif search
# we use here the Zinc finger Cys2His2-like fold group
# its PROSITE signature is available at https://prosite.expasy.org/PS00028
my_query = '''
{
  "query": {
    "type": "terminal",
    "service": "seqmotif",
    "parameters": {
      "value": "C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H",
      "pattern_type": "prosite",
      "target": "pdb_protein_sequence"
    }
  },
  "return_type": "polymer_entity"
}
'''
r = requests.get('http://search.rcsb.org/rcsbsearch/v1/query?json=%s' % requests.utils.requote_uri(my_query))
j = r.json()

In [28]:
j.keys()

dict_keys(['query_id', 'result_type', 'total_count', 'result_set'])

In [29]:
print("There are %s results in total, we got back details for the first %s" % 
      (j['total_count'], len(j['result_set'])))

There are 538 results in total, we got back details for the first 10


In [30]:
print('This is the detailed info for the first result:')
print(j['result_set'][0])
print('\nThe identifiers for the returned results are:')
for item in j['result_set']:
    print(item['identifier'])

This is the detailed info for the first result:
{'identifier': '7NVW_7', 'score': 1.0, 'services': [{'service_type': 'seqmotif', 'nodes': [{'node_id': 30099, 'original_score': 1.0, 'norm_score': 1.0, 'match_context': [{'start': 360, 'end': 380}]}]}]}

The identifiers for the returned results are:
7NVW_7
1YUJ_3
2DLK_1
5US3_1
1UN6_1
5K5H_1
7CUY_1
1VA1_1
5YEG_1
6AHD_26


## GraphQL 

In [31]:
# a GraphQL query
my_query = '''
{
    entry(entry_id: "4GYD") {
        cell {
            Z_PDB
            angle_alpha
            angle_beta
            angle_gamma
            formula_units_Z
            length_a
            length_b
            length_c
            pdbx_unique_axis
            volume
        }
    }
}
'''

r = requests.get('https://data.rcsb.org/graphql?query=%s' % requests.utils.requote_uri(my_query))
j = r.json()

In [32]:
# check the keys of the dictionary:
j.keys()

dict_keys(['data'])

In [33]:
# explore what is in j['data']:
j['data']

{'entry': {'cell': {'Z_PDB': 24,
   'angle_alpha': 90.0,
   'angle_beta': 90.0,
   'angle_gamma': 90.0,
   'formula_units_Z': None,
   'length_a': 77.721,
   'length_b': 79.803,
   'length_c': 80.154,
   'pdbx_unique_axis': None,
   'volume': None}}}

In [34]:
# print results with some formatting:
params = j['data']['entry']['cell']
for key,value in params.items():
    print(key, ':', value)

Z_PDB : 24
angle_alpha : 90.0
angle_beta : 90.0
angle_gamma : 90.0
formula_units_Z : None
length_a : 77.721
length_b : 79.803
length_c : 80.154
pdbx_unique_axis : None
volume : None


## Uniprot

In [35]:
# get data from Uniprot using the Proteins API
# note that we are returned a list, not a dictionary
requestURL = "https://www.ebi.ac.uk/proteins/api/proteins?offset=0&size=10&accession=P0A3X7&reviewed=true"

# note that we must specify that we want JSON output using the 'headers' parameter to requests.get()
r = requests.get(requestURL, headers={"Accept" : "application/json"})
j = r.json()
type(j)

list

In [36]:
# the returned list holds the entries we asked for
# note that there is only one entry:
len(j)

1

In [37]:
# this one entry is actually a dictionary:
j[0]

{'accession': 'P0A3X7',
 'id': 'CYC6_NOSS1',
 'proteinExistence': 'Evidence at protein level',
 'info': {'type': 'Swiss-Prot',
  'created': '2005-03-15',
  'modified': '2021-09-29',
  'version': 102},
 'organism': {'taxonomy': 103690,
  'names': [{'type': 'scientific',
    'value': 'Nostoc sp. (strain PCC 7120 / SAG 25.82 / UTEX 2576)'}],
  'lineage': ['Bacteria',
   'Cyanobacteria',
   'Nostocales',
   'Nostocaceae',
   'Nostoc']},
 'secondaryAccession': ['P28596'],
 'protein': {'recommendedName': {'fullName': {'value': 'Cytochrome c6'}},
  'alternativeName': [{'fullName': {'value': 'Cytochrome c-553'}},
   {'fullName': {'value': 'Cytochrome c553'}},
   {'fullName': {'value': 'Soluble cytochrome f'}}]},
 'gene': [{'name': {'value': 'petJ'},
   'synonyms': [{'value': 'cytA'}],
   'olnNames': [{'value': 'alr4251'}]}],
 'comments': [{'type': 'FUNCTION',
   'text': [{'value': 'Functions as an electron carrier between membrane-bound cytochrome b6-f and photosystem I in oxygenic photosynthe

In [38]:
# check the keys of the dictionary j[0]
j[0].keys()

dict_keys(['accession', 'id', 'proteinExistence', 'info', 'organism', 'secondaryAccession', 'protein', 'gene', 'comments', 'features', 'dbReferences', 'keywords', 'references', 'sequence'])

In [39]:
# these are the entries in the key 'dbReferences'
j[0]['dbReferences']

[{'type': 'EMBL',
  'id': 'M97009',
  'properties': {'molecule type': 'Genomic_DNA',
   'protein sequence ID': 'AAA59365.1'}},
 {'type': 'EMBL',
  'id': 'BA000019',
  'properties': {'molecule type': 'Genomic_DNA',
   'protein sequence ID': 'BAB75950.1'}},
 {'type': 'PIR', 'id': 'AD2337', 'properties': {'entry name': 'AD2337'}},
 {'type': 'PIR', 'id': 'I39601', 'properties': {'entry name': 'I39601'}},
 {'type': 'RefSeq',
  'id': 'WP_010998389.1',
  'properties': {'nucleotide sequence ID': 'NZ_RSCN01000010.1'}},
 {'type': 'PDB',
  'id': '4GYD',
  'properties': {'method': 'X-ray',
   'chains': 'A/B/C/D/E/F=26-111',
   'resolution': '1.80 A'}},
 {'type': 'PDB',
  'id': '4H0J',
  'properties': {'method': 'X-ray',
   'chains': 'A/B/C/D/E/F=26-111',
   'resolution': '2.00 A'}},
 {'type': 'PDB',
  'id': '4H0K',
  'properties': {'method': 'X-ray',
   'chains': 'A/B=26-111',
   'resolution': '1.95 A'}},
 {'type': 'PDBsum', 'id': '4GYD'},
 {'type': 'PDBsum', 'id': '4H0J'},
 {'type': 'PDBsum', 'id

In [40]:
# print the data we were looking for:
print("Data for accession %s (ID: %s)" % (j[0]['accession'], j[0]['id']))
print("List of Gene Ontologies:")
for item in j[0]['dbReferences']:
    if item['type'] == "GO":
        print("  id: %s, term: %s, source: %s" % (
                item['id'],
                item['properties']['term'],
                item['properties']['source']))


Data for accession P0A3X7 (ID: CYC6_NOSS1)
List of Gene Ontologies:
  id: GO:0031979, term: C:plasma membrane-derived thylakoid lumen, source: IEA:UniProtKB-SubCell
  id: GO:0009055, term: F:electron transfer activity, source: IEA:UniProtKB-UniRule
  id: GO:0020037, term: F:heme binding, source: IEA:InterPro
  id: GO:0005506, term: F:iron ion binding, source: IEA:InterPro
  id: GO:0015979, term: P:photosynthesis, source: IEA:UniProtKB-UniRule


## NCBI

In [41]:
headers = {'Accept': 'application/json'}
r = requests.get('https://api.ncbi.nlm.nih.gov/datasets/v1alpha/gene/id/%s' % 8291, headers=headers)
j = r.json()
j

{'genes': [{'query': ['8291'],
   'gene': {'gene_id': '8291',
    'symbol': 'DYSF',
    'description': 'dysferlin',
    'tax_id': '9606',
    'taxname': 'Homo sapiens',
    'type': 'PROTEIN_CODING',
    'orientation': 'plus',
    'genomic_ranges': [{'accession_version': 'NC_000002.12',
      'range': [{'begin': '71453154',
        'end': '71686763',
        'orientation': 'plus'}]}],
    'reference_standards': [{'gene_range': {'accession_version': 'NG_008694.1',
       'range': [{'begin': '4939', 'end': '238141', 'orientation': 'plus'}]},
      'type': 'REFSEQ_GENE'}],
    'transcripts': [{'accession_version': 'XR_001738969.1',
      'name': 'transcript variant X3',
      'length': 5686,
      'genomic_range': {'accession_version': 'NC_000002.12',
       'range': [{'begin': '71466685',
         'end': '71665275',
         'orientation': 'plus'}]},
      'exons': {'accession_version': 'NC_000002.12',
       'range': [{'begin': '71466685', 'end': '71466933', 'order': 1},
        {'begin'

In [42]:
gene = j['genes'][0]['gene']
gene['description']

'dysferlin'