# Databases

**Topics:**

* [Databases](#databases)

## Databases

### Table of Contents

1. [ChEMBL Database](#chembl-database)
2. [PubChem Database](#pubchem-database)
3. [PDB Database](#pdb-database)


### ChEMBL Database

<div style="text-align: center"><img src="images/ChEMBL_web_service_diagram.png" width="75%"></div>

**ChEMBL Database** is a manually curated database of bioactive molecules with drug-like properties. It contains information on approximately 2 million compounds, their bioactivities, and associated drug targets. The data in ChEMBL is derived from scientific literature and is often used in drug discovery and development processes.

**Data fetching and downloading from ChEMBL** can be efficiently done using the ChEMBL Web Resource Client, which allows easy access to ChEMBL data via the REST API. Researchers can query for molecules, targets, and bioactivities using various filters and retrieve data in a structured format for further analysis.

To learn more about how ChEMBL Web Resource Client, you can visit their [github](https://github.com/chembl/chembl_webresource_client) and [paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4489243/)

To get a sense of how useful ChEMBL Web Resource Client is, we will walk through a few examples:
* How to retrieve molecules from ChEMBL
* Retrieving protein target-based data
* Retrieving bioactivity data

In [None]:
from chembl_webresource_client.new_client import new_client

# available data entities
available_resources = [resource for resource in dir(new_client) if not resource.startswith('_')]
print(available_resources)

**Available filters**

The design of the client is based on [Django QuerySet](https://docs.djangoproject.com/en/1.11/ref/models/querysets) and most important lookup types are supported. These are:

- exact
- iexact
- contains
- icontains
- in
- gt
- gte
- lt
- lte
- startswith
- istartswith
- endswith
- iendswith
- range
- isnull
- regex
- iregex


**Only operator**

`only` is a special method allowing to limit the results to a selected set of fields. `only` should take a single argument: a list of fields that should be included in result. Specified fields have to exists in the endpoint against which `only` is executed. Using `only` will usually make an API call faster because less information returned will save bandwidth. The API logic will also check if any SQL joins are necessary to return the specified field and exclude unnecessary joins with critically improves performance.

Please note that only has one limitation: a list of fields will ignore nested fields i.e. calling `only(['molecule_properties__alogp'])` is equivalent to `only(['molecule_properties'])`.

For many-to-many relationships only will not make any SQL join optimisation.

#### Molecules

Molecule records may be retrieved in a number of ways. We will go through some examples.

In [None]:
# find a molecule by pref_name
molecule = new_client.molecule
mols = molecule.filter(pref_name__iexact='aspirin')
print(mols)

In [None]:
# use of only() to retrieve only specific fields
molecule = new_client.molecule
mols = molecule.filter(pref_name__iexact='aspirin').only('molecule_chembl_id', 'pref_name')
print(mols)

In [None]:
# get many molecules by their ChEMBL IDs
molecule = new_client.molecule
mols = molecule.filter(molecule_chembl_id__in=['CHEMBL25', 'CHEMBL112', 'CHEMBL521']).only('molecule_chembl_id', 'pref_name')
print(mols)

In [None]:
from IPython.display import SVG

# get image of a molecule
image = new_client.image
image.set_format('svg')
img = image.get('CHEMBL25')
SVG(img)

In [None]:
# get molecule structure
molecule = new_client.molecule
mols = molecule.filter(molecule_chembl_id='CHEMBL25').only('molecule_structures')
print(mols)

In [None]:
# find molecules similar to a given SMILES query with similarity threshold of 80%
similarity = new_client.similarity
mols = similarity.filter(smiles='CC(=O)Oc1ccccc1C(=O)O', similarity=80).only('molecule_chembl_id', 'pref_name')
for mol in mols:
    print(mol)

In [None]:
# find molecules similar to a given ChEMBL ID with similarity threshold of 70%
similarity = new_client.similarity
mols = similarity.filter(chembl_id='CHEMBL25', similarity=80).only('molecule_chembl_id', 'pref_name')
for mol in mols:
    print(mol)

In [None]:
# get all approved drugs
molecule = new_client.molecule
approved_drugs = molecule.filter(max_phase=4).only('molecule_chembl_id', 'pref_name').order_by('molecule_properties__mw_freebase')
approved_drugs

In [None]:
# get approved drugs for lung cancer
drug_indication = new_client.drug_indication
molecule = new_client.molecule

lung_cancer_ind = drug_indication.filter(efo_term__icontains='LUNG CARCINOMA')
chembl_ids = [x['molecule_chembl_id'] for x in lung_cancer_ind]
lung_cancer_drugs = molecule.filter(molecule_chembl_id__in=chembl_ids).only('molecule_chembl_id', 'pref_name')
print(len(lung_cancer_drugs))
lung_cancer_drugs

In [None]:
# filter drugs by approval year and name
drug = new_client.drug
res = drug.filter(first_approval__gte=1980).filter(usan_stem="-azosin").only('molecule_chembl_id', 'first_approval', 'usan_stem', 'usan_stem_definition')
res

In [None]:
# get all biotherapeutics
molecule = new_client.molecule
biotherapeutics = molecule.filter(biotherapeutic__isnull=False)
len(biotherapeutics)

In [None]:
# get molecules with molecular weight between 100 and 200
molecule = new_client.molecule
mols = molecule.filter(molecule_properties__mw_freebase__gte=100, molecule_properties__mw_freebase__lte=200)
print(len(mols))

In [None]:
# get molecules with molecular weight <= 300 and ends with 'nib'
molecule = new_client.molecule
light_nib_molecules = molecule.filter(molecule_properties__mw_freebase__lte=300, pref_name__iendswith="nib").only(['molecule_chembl_id', 'pref_name'])
light_nib_molecules

In [None]:
# get molecules with no violations of Lipinski's rule of five
# Lipinski's rule of five states that a molecule is more likely to be orally bioavailable if it has:
# - no more than 5 hydrogen bond donors
# - no more than 10 hydrogen bond acceptors
# - a molecular weight less than 500
# - Calculated log of octanol-water partition (clogP) less than 5
molecule = new_client.molecule
no_violations = molecule.filter(molecule_properties__num_ro5_violations=0)
len(no_violations)

#### Targets

Examples of retrieving target data from ChEMBL:

In [None]:
from chembl_webresource_client.new_client import new_client

# get targets for a given gene name
target = new_client.target
gene_name = 'BRD4'
res = target.filter(target_synonym__icontains=gene_name).only(['organism', 'pref_name', 'target_chembl_id', 'target_type'])
for i in res:
    print(i)

In [None]:
# get target for a given ChEMBL ID
target = new_client.target
res = target.filter(target_chembl_id='CHEMBL217').only(['organism', 'pref_name', 'target_chembl_id', 'target_type'])
for i in res:
    print(i)

In [None]:
# get targets for a given uniprot ID
target = new_client.target
targets = target.filter(target_components__accession='P04629').only(['organism', 'pref_name', 'target_chembl_id', 'target_type'])
for i in targets:
    print(i)

#### Activities

Examples of retrieving activity data from ChEMBL:

In [None]:
from chembl_webresource_client.new_client import new_client

# get all activities for a given target
target = new_client.target
activity = new_client.activity
herg = target.filter(pref_name__iexact='hERG').only('target_chembl_id')[0]
herg_activities = activity.filter(target_chembl_id=herg['target_chembl_id']).filter(standard_type="IC50")

len(herg_activities)

In [None]:
# get all activities for a given target with assay type of B (binding)
activity = new_client.activity
activities = activity.filter(target_chembl_id='CHEMBL1824', assay_type='B')
len(activities)

In [None]:
# get all activities with a pChEMBL for a molecule
activity = new_client.activity
activities = activity.filter(molecule_chembl_id='CHEMBL25', pchembl_value__isnull=False)
len(activities)

In [None]:
from chembl_webresource_client import new_client



### PubChem Database

**PubChem Database** is an open chemistry database maintained by the NCBI, containing information on the biological activities of small molecules. It includes three main databases: Substance, Compound, and BioAssay, providing a comprehensive overview of the chemical and biological properties of molecules.

**The PUG REST (Power User Gateway Representational State Transfer) web service** is an API provided by PubChem that allows users to programmatically access the data in PubChem. It supports a variety of queries, including searching for chemical compounds, retrieving molecular information, and accessing biological assay data. 

To learn more about PUG REST, visit the [PUG REST website](https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest) and the [tutorial](https://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST_Tutorial.html) provided by PubChem.

In the following examples we will be using [PubChemPy](https://pubchempy.readthedocs.io/en/latest/index.html) Python package. This is a simplified wrapper around PUG REST. However, it does not contain all the useful features of PUG REST. For more more complicated data pipelines, you should refer to the [PUG REST website](https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest).

In [None]:
import pubchempy as pcp

# get compound by CID
compound = pcp.Compound.from_cid(5090)
compound.to_dict()

In [None]:
# access compound properties
compound = pcp.Compound.from_cid(5090)
print(compound.synonyms)
print(compound.isomeric_smiles)
print(compound.molecular_formula)
print(compound.molecular_weight)
print(compound.xlogp)

In [None]:
# search for compounds by an identifier
# first arg: identifier
# second arg: namespace or identifier_type, e.g. 'name', 'smiles', 'inchi', 'inchikey'
compounds = pcp.get_compounds('acetaminophen', 'name')
print(compounds)
print(compounds[0].isomeric_smiles)

In [None]:
# search for compounds by SMILES
compounds = pcp.get_compounds('CC(=O)Nc1ccc(O)cc1', 'smiles')
print(compounds)
print(compounds[0].synonyms)

In [None]:
# retrieve only specific properties
compounds = pcp.get_properties(['IsomericSMILES', 'MolecularFormula', 'MolecularWeight'], 'CC(=O)Nc1ccc(O)cc1', 'smiles')
print(compounds)

In [None]:
# search for compounds by similarity
compounds = pcp.get_compounds('CC(=O)Nc1ccc(O)cc1', 'smiles', searchtype='similarity', Threshold=90)
print(len(compounds))

### PDB Database

**The Protein Data Bank (PDB)** is a repository for the 3D structural data of large biological molecules. It is widely used in structural biology, molecular modeling, and bioinformatics to study the structure-function relationships of biomolecules.

**[Biopython](https://biopython.org/) package** is an open-source collection of tools for biological computation written in Python. It provides functionalities for working with sequences, structures, and other biological data, making it widely used in bioinformatics and computational biology.

`biopython` includes modules specifically designed for handling and analyzing structural data from PDB files. The `Bio.PDB` module allows users to parse PDB files, manipulate protein structures, and extract detailed information such as atoms, residues, chains, and secondary structure elements. This makes `biopython` an essential tool for researchers working with 3D biomolecular structures, enabling them to easily integrate structural data into their analyses and workflows.

<div style="text-align: center"><img src="images/biopython_structure.png" width="50%"/></div>

**[NGLView](https://nglviewer.org/nglview/latest/) package** is a Python package that provides an interactive widget for visualizing molecular structures directly in Jupyter notebooks. It is built on top of the NGL.js library and allows users to load, display, and interact with 3D structures from formats like PDB, MOL2, and others. With `nglview`, you can rotate, zoom, and explore molecular models, making it a powerful tool for visualizing biomolecular data in a user-friendly and interactive manner.

We will use `biopython` to explore PDB and manipulate protein structures retrieved directly from PDB. Additionally, we will use `nglview` to display prtotein structure within a Jupyter notebook environment.

In [None]:
from Bio.PDB import PDBList

# download a PDB file
pdbl = PDBList()
pdbl.retrieve_pdb_file('6JIM', pdir='./data', file_format='mmCif')

In [None]:
# download multiple PDB files
pdbl = PDBList()
pdbl.download_pdb_files(['6JIM', '1FAT'], pdir='./data', file_format='mmCif')

In [None]:
# download entire PDB
# this will take 2-4 days
pdbl = PDBList()
# pdbl.download_entire_pdb(file_format='mmCif')

In [None]:
# update the local copy of the PDB
pdbl = PDBList()
# pdbl.update_pdb(file_format='mmCif')

In [None]:
from Bio.PDB import MMCIFParser

# biopython Structure object
parser = MMCIFParser()
structure = parser.get_structure('6JIM', './data/6jim.cif')
structure

In [None]:
# iterate over the structure
for model in structure:
    for chain in model:
        print(f'Chain {chain.get_id()}')
        for residue in chain:
            print(residue.get_id(), residue.get_resname(), end=': ')
            for atom in residue:
                print(atom.get_name(), end=',')
            print()

In [None]:
# extract a specific atom from the structure
model = structure[0]
chain = model['A']
residue = chain[50]
atom = residue['CA']
# alternatively
atom = structure[0]['A'][50]['CA']
# print some properties of the atom
print(atom.element)
print(atom.get_name())
print(atom.get_coord())
print(atom.get_bfactor())
print(atom.get_occupancy())

In [None]:
# alternative way to navigate the structure
# iterate over the models
print('Models:')
for model in structure.get_models():
    print(model.get_id(), end=',')
print()
# iterate over the chains
print('Chains:')
for chain in structure.get_chains():
    print(chain.get_id(), end=',')
print()
# iterate over the residues
print('Residues:')
for residue in structure.get_residues():
    print(residue.get_resname(), end=',')
print()
# iterate over the atoms
print('Atoms:')
for atom in structure.get_atoms():
    print(atom.get_name(), end=',')

In [None]:
# modifying the structure
chain_A = structure[0]['A']
# find all water residues
res_ids_to_remove = []
for residue in chain_A.get_residues():
    if residue.get_id()[0] == 'W':
        res_ids_to_remove.append(residue.get_id())

# remove the water residues
for res_id in res_ids_to_remove[::-1]:
    chain_A.detach_child(res_id)

# print the modified structure to verify the removal
for residue in structure[0]['A'].get_residues():
    print(residue.get_id(), residue.get_resname())

In [None]:
# let's make the above code into a function
def remove_water(chain_id, structure):
    chain = structure[0][chain_id]
    res_ids_to_remove = []
    for residue in chain.get_residues():
        if residue.get_id()[0] == 'W':
            res_ids_to_remove.append(residue.get_id())
    for res_id in res_ids_to_remove[::-1]:
        chain.detach_child(res_id)

# run the function for all chains in the structure
for chain in structure.get_chains():
    remove_water(chain.get_id(), structure)

In [None]:
from Bio.PDB import PDBIO

# save the modified structure
io = PDBIO()
io.set_structure(structure)
io.save('./data/6jim_modified.pdb')

In [None]:
import nglview as nv

# view structure in nglview
view = nv.show_biopython(structure)
view