# Biomedical Data Bases, 2020-2021
### Programmatic Access to Databases
These are the notes by prof. Davide Salomoni (d.salomoni@unibo.it) for the Biomedical Data Base course at the University of Bologna, academic year 2020-2021.

## Running an external script from Python

Here we demonstrate the use of the _subprocess_ module.

In [1]:
# An example of running an external program (here: 'ls -l') from Python
import subprocess

sp = subprocess.run('ls -l', shell=True, capture_output=True, text=True)
print(sp.stdout)

total 67780
-rw-rw-r-- 1 jovyan users   785065 Sep  3  2015 AirQualityUCI.csv
-rw-r--r-- 1 jovyan users      472 Jan 18 17:47 aminoacids.csv
-rw-r--r-- 1 jovyan users     8192 Jan 18 18:07 aminoacids.sqlite
-rw-r--r-- 1 jovyan users     2109 Jan 19 07:18 batch_download.sh
drwxr-xr-x 9 jovyan users      288 Jan 19 14:01 bdb-2021
-rw-r--r-- 1 jovyan users    17675 Jan 11 12:18 bdb_notes.ipynb
-rw-r--r-- 1 jovyan users      750 Dec 26 11:32 check_docker_ps.py
-rw-r--r-- 1 jovyan users   139490 Jan 18 13:22 Consumer_demo.ipynb
-rw-r--r-- 1 jovyan users    26559 Dec 12 13:29 COVID-19-italy-only.xlsx
-rw-r--r-- 1 jovyan users  4189941 Dec  8 19:05 COVID-19-sample-BDB2021.csv
-rw-r--r-- 1 jovyan users  3499284 Dec  8 19:06 COVID-19-sample-BDB2021.xlsx
drwxr-xr-x 7 jovyan users      224 Jan 17 16:18 data
-rw-r--r-- 1 jovyan users    63775 Jan 11 08:21 excel-analysis-demo.ipynb
-rw-r--r-- 1 jovyan users   150162 Jan  8 16:15 excel-analysis.ipynb
-rw-r--r-- 1 jovyan users     2580 Jan 18 13:21 G

In [2]:
# let's clone the BDB github repo locally
import subprocess

print('Fetching the BDB repo from GitHub...')
git_repo = 'https://github.com/dsalomoni/bdb-2021.git'
sp = subprocess.run('git clone %s' % git_repo, shell=True, capture_output=True, text=True)
if sp.returncode:
    # an error, try to pull the repo
    sp = subprocess.run('cd bdb-2021 && git pull %s' % git_repo, shell=True, capture_output=True, text=True)
    if sp.returncode:
        # another error, give up
        print("Error fetching the repo:")
        print(sp.stderr)
    else:
        print("Pulled latest changes:")
        print(sp.stdout, sp.stderr)
else:
    print("Cloning finished:")
    print(sp.stdout, sp.stderr)

Fetching the BDB repo from GitHub...
Pulled latest changes:
Already up to date.
 From https://github.com/dsalomoni/bdb-2021
 * branch            HEAD       -> FETCH_HEAD



## Using the _requests_ module

In [3]:
# test the requests module querying Google
import requests
res = requests.get('http://www.google.com')
print(res)

<Response [200]>


In [4]:
print(res.status_code)

200


In [5]:
# if the requests call succeeded, print the text that was returned.
res = requests.get('http://www.google.com')
if res.status_code == 200:
    print(res.text)

<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="it"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="TIeM4gNr9S/xlAnHBAMPLQ==">(function(){window.google={kEI:'_v0HYOT8KbTVgwf9ipjIBQ',kEXPI:'0,775921,583488,730,224,5104,207,1699,1505,10,1226,364,1499,817,383,246,5,1129,225,222,4192,3,66,767,217,1264,1464,2395,7,3740,433,1112996,1232,1196520,139,391,8,328977,51223,16115,28684,9188,8384,4859,1361,9291,3024,4743,12841,4020,978,13226,2056,920,873,10622,7432,7096,4517,2778,920,2275,8,85,2711,1593,1279,2212,530,149,1103,840,520,1463,56,4258,312,1137,2,2669,2023,1777,520,1947,2229,93,328,1284,2943,2247,3599,3227,1990,855,7,4773,7581,5096,600,7276,4929,108,3407,908,2,941,2614,2397,7468,3277,3,346,230,970,865,3,4622,148,5990,7985,4,1528,2304,1236,1145,4658,537,43,1211,1373,1056,17,447,459,1202,353,4067,5634,1426,549

### Remember to check the status code

In [6]:
r = requests.get('https://github.com/timelines.json')
print(r.status_code)

410


In [7]:
print(r.text)

{"message":"Hello there, wayfaring stranger. If you’re reading this then you probably didn’t see our blog post a couple of years back announcing that this API would go away: http://git.io/17AROg Fear not, you should be able to get what you need from the shiny new Events API instead.","documentation_url":"https://docs.github.com/v3/activity/events/#list-public-events-performed-by-a-user"}


### Query PDB with REST calls

In [8]:
my_protein = '4GYD'
pdb_url = 'https://files.rcsb.org/download/%s.pdb' % my_protein
fasta_url = 'https://www.rcsb.org/fasta/entry/%s' % my_protein
# get the pdb and fasta representations
pdb = requests.get(pdb_url)
fasta = requests.get(fasta_url)

In [9]:
# print the text returned in the fasta variable
fasta.text

'>4GYD_1|Chains A,B,C,D,E,F|Cytochrome c6|Nostoc (103690)\nADSVNGAKIFSANCASCHAGGKNLVQAQKTLKKADLEKYGMYSAEAIIAQVTNGKNAMPAFKGRLKPEQIEDVAAYVLGKADADWK\n'

In [10]:
# print the text returned in the pdb variable
pdb.text

'HEADER    ELECTRON TRANSPORT                      05-SEP-12   4GYD              \nTITLE     NOSTOC SP CYTOCHROME C6                                               \nCOMPND    MOL_ID: 1;                                                            \nCOMPND   2 MOLECULE: CYTOCHROME C6;                                             \nCOMPND   3 CHAIN: A, B, C, D, E, F;                                             \nCOMPND   4 SYNONYM: CYTOCHROME C-553, CYTOCHROME C553, SOLUBLE CYTOCHROME F;    \nCOMPND   5 ENGINEERED: YES                                                      \nSOURCE    MOL_ID: 1;                                                            \nSOURCE   2 ORGANISM_SCIENTIFIC: NOSTOC;                                         \nSOURCE   3 ORGANISM_TAXID: 103690;                                              \nSOURCE   4 STRAIN: PCC 7120;                                                    \nSOURCE   5 GENE: PETJ;                                                          \nSOURCE   6 EXPR

In [11]:
# pdb.text is a rather long string:
len(pdb.text)

445662

In [12]:
# print info from multiple PDB entries
proteins = ['4GYD', '4H0J', '4H0K']
fasta = dict()
for p in proteins:
    r = requests.get('https://www.rcsb.org/fasta/entry/%s' % p)
    fasta[p] = r.text

for f in fasta:
    print(f, ':', fasta[f])

4GYD : >4GYD_1|Chains A,B,C,D,E,F|Cytochrome c6|Nostoc (103690)
ADSVNGAKIFSANCASCHAGGKNLVQAQKTLKKADLEKYGMYSAEAIIAQVTNGKNAMPAFKGRLKPEQIEDVAAYVLGKADADWK

4H0J : >4H0J_1|Chains A,B,C,D,E,F|Cytochrome c6|Nostoc (103690)
ADSVNGAKIFSANCASCHAGGKNLVQAQKTLKKADLEKYGMYSAEAIIAQVTNGKNACPAFKGRLKPEQIEDVAAYVLGKADADWK

4H0K : >4H0K_1|Chains A,B|Cytochrome c6|Nostoc (103690)
ADSVNGAKIFSANCASCHAGGKNLVQAQKTLKKADLEKYGMYSAEAIIAQVTNGKNAHPAFKGRLKPEQIEDVAAYVLGKADADWK



### Requesting JSON output

Having the results as JSON makes extracting info much easier than having to parse regular text.

In [13]:
# query the PDB using the REST API. It returns JSON output.
r = requests.get('https://data.rcsb.org/rest/v1/core/entry/4GYD')
data = r.json()
type(data)

dict

In [14]:
# 'data' is a dictionary: check what are its keys:
data.keys()

dict_keys(['audit_author', 'cell', 'citation', 'diffrn', 'diffrn_detector', 'diffrn_radiation', 'diffrn_source', 'entry', 'exptl', 'exptl_crystal', 'exptl_crystal_grow', 'pdbx_audit_revision_category', 'pdbx_audit_revision_details', 'pdbx_audit_revision_group', 'pdbx_audit_revision_history', 'pdbx_audit_revision_item', 'pdbx_database_related', 'pdbx_database_status', 'pdbx_vrpt_summary', 'rcsb_accession_info', 'rcsb_entry_container_identifiers', 'rcsb_entry_info', 'rcsb_primary_citation', 'refine', 'refine_hist', 'refine_ls_restr', 'reflns', 'reflns_shell', 'software', 'struct', 'struct_keywords', 'symmetry', 'rcsb_id'])

In [15]:
# get info from the 'cell' key:
data['cell']

{'angle_alpha': 90.0,
 'angle_beta': 90.0,
 'angle_gamma': 90.0,
 'length_a': 77.721,
 'length_b': 79.803,
 'length_c': 80.154,
 'zpdb': 24}

In [16]:
# get info for polymer entity data, providing PDB ID and polymer ID:
r = requests.get('https://data.rcsb.org/rest/v1/core/polymer_entity/4GYD/1')
data = r.json()
data.keys()

dict_keys(['rcsb_cluster_membership', 'entity_poly', 'entity_src_gen', 'rcsb_entity_host_organism', 'rcsb_entity_source_organism', 'rcsb_polymer_entity', 'rcsb_polymer_entity_align', 'rcsb_polymer_entity_annotation', 'rcsb_polymer_entity_container_identifiers', 'rcsb_polymer_entity_feature_summary', 'rcsb_polymer_entity_name_com', 'rcsb_id', 'rcsb_cluster_flexibility', 'rcsb_latest_revision'])

In [17]:
# see what's inside the 'entity_poly' key:
data['entity_poly']

{'nstd_linkage': 'no',
 'nstd_monomer': 'no',
 'pdbx_seq_one_letter_code': 'ADSVNGAKIFSANCASCHAGGKNLVQAQKTLKKADLEKYGMYSAEAIIAQVTNGKNAMPAFKGRLKPEQIEDVAAYVLGKADADWK',
 'pdbx_seq_one_letter_code_can': 'ADSVNGAKIFSANCASCHAGGKNLVQAQKTLKKADLEKYGMYSAEAIIAQVTNGKNAMPAFKGRLKPEQIEDVAAYVLGKADADWK',
 'pdbx_strand_id': 'A,B,C,D,E,F',
 'rcsb_artifact_monomer_count': 0,
 'rcsb_conflict_count': 0,
 'rcsb_deletion_count': 0,
 'rcsb_entity_polymer_type': 'Protein',
 'rcsb_insertion_count': 0,
 'rcsb_mutation_count': 0,
 'rcsb_non_std_monomer_count': 0,
 'rcsb_sample_sequence_length': 86,
 'type': 'polypeptide(L)'}

In [18]:
# get annotations
r2 = requests.get('https://data.rcsb.org/rest/v1/core/pubmed/4GYD')
data2 = r2.json()
data2.keys()

dict_keys(['rcsb_id', 'rcsb_pubmed_container_identifiers', 'rcsb_pubmed_doi', 'rcsb_pubmed_abstract_text', 'rcsb_pubmed_affiliation_info', 'rcsb_pubmed_mesh_descriptors', 'rcsb_pubmed_mesh_descriptors_lineage'])

In [19]:
# this is the pubmed abstract:
data2['rcsb_pubmed_abstract_text']

'The rapid transfer of electrons in the photosynthetic redox chain is achieved by the formation of short-lived complexes of cytochrome b6f with the electron transfer proteins plastocyanin and cytochrome c6. A balance must exist between fast intermolecular electron transfer and rapid dissociation, which requires the formation of a complex that has limited specificity. The interaction of the soluble fragment of cytochrome f and cytochrome c6 from the cyanobacterium Nostoc sp. PCC 7119 was studied using NMR spectroscopy and X-ray diffraction. The crystal structures of wild type, M58H and M58C cytochrome c6 were determined. The M58C variant is an excellent low potential mimic of the wild type protein and was used in chemical shift perturbation and paramagnetic relaxation NMR experiments to characterize the complex with cytochrome f. The interaction is highly dynamic and can be described as a pure encounter complex, with no dominant stereospecific complex. Ensemble docking calculations and 

## The PDB Search API

In [20]:
# a BLAST-like example using the PDB search API
fasta = "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLPARTVETRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMNCKCVIS"
my_query = '''{
    "query": {
        "type" : "terminal",
        "service" : "sequence",
        "parameters" : {
            "evalue_cutoff" : 1,
            "identity_cutoff" : 0.9,
            "target" : "pdb_protein_sequence",
            "value" : "%s"
        }
    },
    "request_options" : {
        "scoring_strategy" : "sequence"
    },
    "return_type" : "polymer_entity"
}''' % fasta
r = requests.get('http://search.rcsb.org/rcsbsearch/v1/query?json=%s' % requests.utils.requote_uri(my_query))
j = r.json()

In [21]:
# these are keys of the dictionary:
print(j.keys())

dict_keys(['query_id', 'result_type', 'total_count', 'explain_meta_data', 'result_set'])


In [22]:
# let's print the results:
print("We got %s matches" % j['total_count'])
print("The first %s results follow:" % len(j['result_set']))
for item in j['result_set']:
    print(item['identifier'], "score =", item['score'])

We got 407 matches
The first 10 results follow:
4Q21_1 score = 1.0
5X9S_1 score = 0.6575342465753424
6Q21_1 score = 0.4383561643835616
1IOZ_1 score = 0.4383561643835616
1AA9_1 score = 0.4383561643835616
1Q21_1 score = 0.4383561643835616
2Q21_1 score = 0.3835616438356164
6AMB_1 score = 0.3561643835616438
6KYH_2 score = 0.3287671232876712
1LFD_2 score = 0.3150684931506849


In [23]:
# a sequence motif search
# we use here the Zinc finger Cys2His2-like fold group
# its PROSITE signature is available at https://prosite.expasy.org/PS00028
my_query = '''
{
  "query": {
    "type": "terminal",
    "service": "seqmotif",
    "parameters": {
      "value": "C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H",
      "pattern_type": "prosite",
      "target": "pdb_protein_sequence"
    }
  },
  "return_type": "polymer_entity"
}
'''
r = requests.get('http://search.rcsb.org/rcsbsearch/v1/query?json=%s' % requests.utils.requote_uri(my_query))
j = r.json()

In [24]:
j.keys()

dict_keys(['query_id', 'result_type', 'total_count', 'explain_meta_data', 'result_set'])

In [25]:
print("There are %s results in total, we got back details for the first %s" % 
      (j['total_count'], len(j['result_set'])))

There are 486 results in total, we got back details for the first 10


In [26]:
print('This is the detailed info for the first result:')
print(j['result_set'][0])
print('\nThe identifiers for the returned results are:')
for item in j['result_set']:
    print(item['identifier'])

This is the detailed info for the first result:
{'identifier': '1YUJ_3', 'score': 1.0, 'services': [{'service_type': 'seqmotif', 'nodes': [{'node_id': 28100, 'original_score': 1.0, 'norm_score': 1.0, 'match_context': [{'start': 27, 'end': 48}]}]}]}

The identifiers for the returned results are:
1YUJ_3
2DLK_1
5US3_1
1UN6_1
5K5H_1
7CUY_1
1VA1_1
5YEG_1
6AHD_26
5YEG_2


## GraphQL 

In [27]:
# a GraphQL query
my_query = '''
{
    entry(entry_id: "4GYD") {
        cell {
            Z_PDB
            angle_alpha
            angle_beta
            angle_gamma
            formula_units_Z
            length_a
            length_b
            length_c
            pdbx_unique_axis
            volume
        }
    }
}
'''

r = requests.get('https://data.rcsb.org/graphql?query=%s' % requests.utils.requote_uri(my_query))
j = r.json()

In [28]:
# check the keys of the dictionary:
j.keys()

dict_keys(['data'])

In [29]:
# explore what is in j['data']:
j['data']

{'entry': {'cell': {'Z_PDB': 24,
   'angle_alpha': 90.0,
   'angle_beta': 90.0,
   'angle_gamma': 90.0,
   'formula_units_Z': None,
   'length_a': 77.721,
   'length_b': 79.803,
   'length_c': 80.154,
   'pdbx_unique_axis': None,
   'volume': None}}}

In [30]:
# print results with some formatting:
params = j['data']['entry']['cell']
for key,value in params.items():
    print(key, ':', value)

Z_PDB : 24
angle_alpha : 90.0
angle_beta : 90.0
angle_gamma : 90.0
formula_units_Z : None
length_a : 77.721
length_b : 79.803
length_c : 80.154
pdbx_unique_axis : None
volume : None


## Uniprot

In [31]:
# get data from Uniprot using the Proteins API
# note that we are returned a list, not a dictionary
requestURL = "https://www.ebi.ac.uk/proteins/api/proteins?offset=0&size=10&accession=P0A3X7&reviewed=true"

r = requests.get(requestURL, headers={"Accept" : "application/json"})
j = r.json()
type(j)

list

In [32]:
# the returned list holds the entries we asked for
# note that there is only one entry:
len(j)

1

In [33]:
# this one entry is actually a dictionary:
j[0]

{'accession': 'P0A3X7',
 'id': 'CYC6_NOSS1',
 'proteinExistence': 'Evidence at protein level',
 'info': {'type': 'Swiss-Prot',
  'created': '2005-03-15',
  'modified': '2020-12-02',
  'version': 99},
 'organism': {'taxonomy': 103690,
  'names': [{'type': 'scientific',
    'value': 'Nostoc sp. (strain PCC 7120 / SAG 25.82 / UTEX 2576)'}],
  'lineage': ['Bacteria',
   'Cyanobacteria',
   'Nostocales',
   'Nostocaceae',
   'Nostoc']},
 'secondaryAccession': ['P28596'],
 'protein': {'recommendedName': {'fullName': {'value': 'Cytochrome c6'}},
  'alternativeName': [{'fullName': {'value': 'Cytochrome c-553'}},
   {'fullName': {'value': 'Cytochrome c553'}},
   {'fullName': {'value': 'Soluble cytochrome f'}}]},
 'gene': [{'name': {'value': 'petJ'},
   'synonyms': [{'value': 'cytA'}],
   'olnNames': [{'value': 'alr4251'}]}],
 'comments': [{'type': 'FUNCTION',
   'text': [{'value': 'Functions as an electron carrier between membrane-bound cytochrome b6-f and photosystem I in oxygenic photosynthes

In [34]:
# check the keys of the dictionary j[0]
j[0].keys()

dict_keys(['accession', 'id', 'proteinExistence', 'info', 'organism', 'secondaryAccession', 'protein', 'gene', 'comments', 'features', 'dbReferences', 'keywords', 'references', 'sequence'])

In [35]:
# these are the entries in the key 'dbReferences'
j[0]['dbReferences']

[{'type': 'EMBL',
  'id': 'M97009',
  'properties': {'molecule type': 'Genomic_DNA',
   'protein sequence ID': 'AAA59365.1'}},
 {'type': 'EMBL',
  'id': 'BA000019',
  'properties': {'molecule type': 'Genomic_DNA',
   'protein sequence ID': 'BAB75950.1'}},
 {'type': 'PIR', 'id': 'AD2337', 'properties': {'entry name': 'AD2337'}},
 {'type': 'PIR', 'id': 'I39601', 'properties': {'entry name': 'I39601'}},
 {'type': 'RefSeq',
  'id': 'WP_010998389.1',
  'properties': {'nucleotide sequence ID': 'NZ_RSCN01000010.1'}},
 {'type': 'PDB',
  'id': '4GYD',
  'properties': {'method': 'X-ray',
   'chains': 'A/B/C/D/E/F=26-111',
   'resolution': '1.80 A'}},
 {'type': 'PDB',
  'id': '4H0J',
  'properties': {'method': 'X-ray',
   'chains': 'A/B/C/D/E/F=26-111',
   'resolution': '2.00 A'}},
 {'type': 'PDB',
  'id': '4H0K',
  'properties': {'method': 'X-ray',
   'chains': 'A/B=26-111',
   'resolution': '1.95 A'}},
 {'type': 'PDBsum', 'id': '4GYD'},
 {'type': 'PDBsum', 'id': '4H0J'},
 {'type': 'PDBsum', 'id

In [36]:
# print the data we were looking for:
print("Data for accession %s (ID: %s)" % (j[0]['accession'], j[0]['id']))
print("List of Gene Ontologies:")
for item in j[0]['dbReferences']:
    if item['type'] == "GO":
        print("  id: %s, term: %s, source: %s" % (
                item['id'],
                item['properties']['term'],
                item['properties']['source']))


Data for accession P0A3X7 (ID: CYC6_NOSS1)
List of Gene Ontologies:
  id: GO:0031977, term: C:thylakoid lumen, source: IEA:UniProtKB-SubCell
  id: GO:0009055, term: F:electron transfer activity, source: IEA:UniProtKB-UniRule
  id: GO:0020037, term: F:heme binding, source: IEA:InterPro
  id: GO:0005506, term: F:iron ion binding, source: IEA:InterPro
  id: GO:0015979, term: P:photosynthesis, source: IEA:UniProtKB-UniRule


## NCBI

In [37]:
headers = {'Accept': 'application/json'}
r = requests.get('https://api.ncbi.nlm.nih.gov/datasets/v1alpha/gene/id/%s' % 8291, headers=headers)
j = r.json()
j

{'genes': [{'gene': {'gene_id': '8291',
    'symbol': 'DYSF',
    'description': 'dysferlin',
    'tax_id': '9606',
    'taxname': 'Homo sapiens',
    'type': 'PROTEIN_CODING',
    'orientation': 'plus',
    'genomic_ranges': [{'accession_version': 'NC_000002.12',
      'range': [{'begin': '71453154',
        'end': '71686763',
        'orientation': 'plus'}]}],
    'reference_standards': [{'gene_range': {'accession_version': 'NG_008694.1',
       'range': [{'begin': '4939', 'end': '238141', 'orientation': 'plus'}]},
      'type': 'REFSEQ_GENE'}],
    'transcripts': [{'accession_version': 'XR_001738969.1',
      'name': 'transcript variant X3',
      'length': 5686,
      'genomic_range': {'accession_version': 'NC_000002.12',
       'range': [{'begin': '71466685',
         'end': '71665275',
         'orientation': 'plus'}]},
      'exons': {'accession_version': 'NC_000002.12',
       'range': [{'begin': '71466685', 'end': '71466933', 'order': 1},
        {'begin': '71480883', 'end': '

In [38]:
gene = j['genes'][0]['gene']
gene['description']

'dysferlin'