# Biomedical Data Bases, 2021-2022
###  Create Your Own Database
These are notes by prof. Davide Salomoni (d.salomoni@unibo.it) for the Biomedical Data Base course at the University of Bologna, academic year 2021-2022.

## Install the redis module

Remember that you should have already started the Redis _with persistence_.

In [None]:
! pip install redis

In [None]:
import redis
r = redis.Redis(host="my_redis")

Is Redis running? Check with the `ping()` function. If Redis is running, it will return `True`. If Redis is _not_ running, `ping()` will raise an exception. Run the following cell when the Redis container is running and also when it is not.

In [None]:
try:
    if r.ping():
        print('Redis is running')
except redis.ConnectionError:
    print('Redis is NOT running')

## Verify how to map a Python dictionary to a Redis hash

In [None]:
# create a test Python dictionary
my_dict = {'one': 1, 'two': 2, 'three': 3, 'four': 4}
print(my_dict)

In [None]:
# create the hash "numbers" in redis
r.hset('numbers', mapping=my_dict)

# get the hash back from redis as a python dictionary
new_dict = r.hgetall('numbers')
print(new_dict)

In [None]:
# find all keys in the DB matching the expression '*umb*'
my_keys = r.keys('*umb*')
print(my_keys)

In [None]:
# delete the key 'numbers' from Redis
r.delete('numbers')

# confirm it is now deleted
print(r.hgetall('numbers'))

## Query PDB, Uniprot and store the results in Redis

Refer to the slides for details about the data model.

In [None]:
import requests
pdb_query = '''
{
  entries(entry_ids: ["4GYD", "1TU2"]) {
    entry {
      id
    }
    rcsb_entry_info {
      molecular_weight
      deposited_atom_count
      deposited_modeled_polymer_monomer_count
    }
    polymer_entities {
      rcsb_entity_source_organism {
        ncbi_scientific_name
      }
      uniprots {
        rcsb_uniprot_container_identifiers {
          uniprot_id
        }
        rcsb_uniprot_protein {
          name {
            value
          }
        }
      }
    }
  }
}
'''
# get the PDB data with GraphQL
p = requests.get('https://data.rcsb.org/graphql?query=%s' % requests.utils.requote_uri(pdb_query))
j = p.json()

In [None]:
# which keys are there?
j.keys()

In [None]:
# explore what the returned data looks like:
# it is a set of nested Python data structures;
# we will need to extract the values we need
j['data']

In [None]:
# for example, extract some macromolecule parameters
for prot in (j['data']['entries']):
    # each entry corresponds to a single PDB ID
    print("ID : ", prot['entry']['id'])
    print("Macromolecule parameters:")
    print("  molecular weight (kDa); ", prot['rcsb_entry_info']['molecular_weight'])

In [None]:
# extract data and update the Redis database
# let's start with a clean database (WARNING: THIS WILL DELETE ALL EXISTING ENTRIES)
r.flushall()
# the print() statements below are for explanatory purposes
for protein in j['data']['entries']:
    # parameters at the individual PDB entry level
    pdb_id = protein['entry']['id']
    print("PDB:", pdb_id)
    weight = protein['rcsb_entry_info']['molecular_weight']
    atom_count = protein['rcsb_entry_info']['deposited_atom_count']
    residue_count = protein['rcsb_entry_info']['deposited_modeled_polymer_monomer_count']
    # store an entry (a hash) with the parameters above in Redis
    # the key will be the PDB ID
    pdb_dict = {'weight': weight, 'atom_count': atom_count, 'residue_count': residue_count}
    r.hset(pdb_id, mapping=pdb_dict)
    # update the PDB index
    r.sadd('PDB:index', pdb_id)
    for polymer in protein['polymer_entities']:
        # parameters for the polymers
        source_name = polymer['rcsb_entity_source_organism'][0]['ncbi_scientific_name']
        for uprot in polymer['uniprots']:
            # uniprot-related data
            uprot_id = uprot['rcsb_uniprot_container_identifiers']['uniprot_id']
            uprot_name = uprot['rcsb_uniprot_protein']['name']['value']
            print("Uniprot:", uprot_id, source_name, uprot_name)
            # store an entry (a hash) with the source_name and uprot_name in Redis
            # the key will be PDB_ID:UNIPROT_ID
            key = '%s:%s' % (pdb_id, uprot_id)
            r.hset(key, 'organism', source_name)
            r.hset(key, 'name', uprot_name)
            # update the Uniprot index
            r.sadd('UNIPROT:index', uprot_id)
            # call the uniprot REST API looking up uprot_id
            uniprot_url = 'https://www.ebi.ac.uk/proteins/api/proteins?offset=0&size=10&accession=%s' % uprot_id
            u = requests.get(uniprot_url, headers={"Accept" : "application/json"})
            # the Gene Ontology information is stored in the 'dbReferences' structure (see slides)
            db_info = u.json()[0]['dbReferences']
            for db in db_info:
                if db['type'] == 'GO':
                    # it is a Gene Ontology entry
                    go_id = db['id']
                    go_term = db['properties']['term']
                    go_source = db['properties']['source']
                    print(go_id, go_term, go_source)
                    # store an entry (a hash) with GO info in Redis
                    # the key will be PDB_ID:UNIPROT_ID:GO_ID
                    key = '%s:%s:%s' % (pdb_id, uprot_id, go_id)
                    go_dict = {'go_term': go_term, 'go_source': go_source}
                    r.hset(key, mapping=go_dict)


### Performing queries on the Redis database

We have created keys so that it is easy to perform queries (look this up on the slides). With the code above, we have created the following keys:
- keys called _PBD_ID_ mapped to a hash containing weight, atom count and residue count (using _hset_)
- keys called _PBD_ID:UNIPROT_ID_, mapped to a hash containing the scientific name and the residue name (using _hset_)
- keys called _PBD_ID:UNIPROT_ID:GO_ID_, mapped to a hash containing GO term and GO source (using _hset_)
- a single key called PBD:index, mapped to a set with all the PDB IDs (using _sadd_)
- a single key called UNIPROT:index, mapped to a set with all the Uniprot IDs (using _sadd_)

In [None]:
# all characteristics of a given PDB ID:
r.hgetall('4GYD')

In [None]:
# all PDB IDs stored in the database:
k = r.smembers('PDB:index')
print(k)

In [None]:
# all Uniprot IDs stored in the database:
k = r.smembers('UNIPROT:index')
print(k)

In [None]:
# all GO entries for a certain Uniprot ID:
k = r.keys('*:Q93SW9:GO:*')
print(k)

In [None]:
# all information about a certain Uniprot ID
# and all information about its GO entries
print(r.hgetall('1TU2:Q93SW9'))
for k in r.keys('1TU2:Q93SW9:*'):
    print(k, r.hgetall(k))

In [None]:
# after a restart of the Redis database, verify that we still have the entries
# note that in Redis by default entries are stored as "bytes"
# so before processing the return values we decode them to string
r = redis.Redis(host="my_redis")
for pdb in r.smembers('PDB:index'):
    values = {k.decode():v.decode() for k,v in r.hgetall(pdb).items()}
    print("PDB ID:", pdb.decode())
    print("  molecular weight (kDa):", values['weight'])
    print("  atom count:", values['atom_count'])
    print("  residue count:", values['residue_count'])