# Structural Biology for Lead Discovery - Practical Class 1: REST interfaces in ChEMBL and Pubchem

## Synopsis

Getting programmatic access to databases allows us to go from muggle to wizard: we have better access to the data and can integrate it seamlessly into our pipelines.

In [None]:
#!pip install rdkit

In [None]:
import json  # lets us work with the json format
import requests  # allows Python to make web requests
import pandas as pd # analysis of tabular data
import rdkit.Chem as chem

## ChEMBL

### The base URL 

A REST GET query consists of using a URL address that specifies the data we want to obtain, and a `GET` header that indicates to the server the kind of operation we want to do.

In this case, we will be using:

    base_url = "https://www.ebi.ac.uk/chembl/api/data/{:s}" 

In [None]:
base_url = "https://www.ebi.ac.uk/chembl/api/data/{:s}" 

### Getting a molecule by its ID

The question, then, is how we encode our queries into an URL. The answer is by learning how the API works. Let's illustrate this by simply getting a molecule through the REST API. In this case, we will be using 

    base_url/molecule/MOLECULE_ID

**Note that when I write in upper characters, it means that is what you need to replace**.

In [None]:
chembl_id = "CHEMBL25"
molecule_url = base_url.format(f"molecule/{chembl_id}")
molecule_url

Now, we do the query. In Python, we use the requests module.

In [None]:
response = requests.get(molecule_url)

if the response went alright, we should be getting a 200 code

In [None]:
assert(response.status_code==200)
response.status_code

Nice, so now, let's see what we got:

In [None]:
# Print the content of the response
print("Response content:")
print(response.content)

### Issues

- We got an XML (I think that all sensible people should hate XML files)
- Finding information here is very complicated. 

But this is not a final boss problem. We can use the `headers` to request a prettier format, such as JSON.

In [None]:
# Send a GET request to the API and retrieve the JSON response
response = requests.get(molecule_url, headers={"Accept": "application/json"})

# Check if the response is successful
assert(response.status_code == 200)
# Convert the response json into a dictionary and after that in a data frame
molecule_request = response.json()
molecule_request

This is nicer to explore: it is just a dictionary. Now, let's make it even prettier. Let's turn it into a table

In [None]:
pd.DataFrame.from_records([response.json()]).T

### Understanding the ChEMBL API

How did I know I had to do `/molecules/CHEMBL25`? By going to your best friend when you are dealing with APIs: the [documentation](https://chembl.gitbook.io/chembl-interface-documentation/web-services/chembl-data-web-services)

Note: This other document is also interesting, https://www.ebi.ac.uk/chembl/api/data/docs

### More advanced queries

Some of the information that we just found is something that we could have already obtained with rdkit. Now, let's go deeper: let's try to obtain a biochemical assay on a given cancer target ("CHEMBL206")

In [None]:
# ChEMBL ID of the molecule

chembl_id = "CHEMBL206"
activity_url = base_url.format (f"activity?target_chembl_id__exact={chembl_id}")
response = requests.get(activity_url, headers={"Accept": "application/json"})
assert(response.status_code == 200)
activity_request = response.json()
activity_table = pd.DataFrame.from_dict(activity_request['activities'])[['molecule_chembl_id', 'type', 'standard_value', 'standard_units']]
activity_table

We get a table with all these values. Now, let's try to sort them out by activity?

In [None]:
#Order by value
activity_table.sort_values(['standard_value'], ascending=[True])

**Question**: What is wrong??

### Data queries in Pandas

Pandas is not simply a pretty looking table reader and writer. It also allows to do some queries


- Sort DataFrame by a specific column in ascending order
    

    ```df.sort_values(by='column_name', ascending=True, inplace=True)```

- Sort DataFrame by multiple columns in descending order

    ```df.sort_values(by=['col1', 'col2'], ascending=[False, True], inplace=True)```

- Filter rows where a column's value meets a specific condition

    ```filtered_df = df[df['column_name'] > 100]```

- Filter rows based on multiple conditions using logical operators (AND: & , OR: |)
    
    ```filtered_df = df[(df['column1'] > 50) & (df['column2'] < 100)]```

- Keep specific columns in the DataFrame

    ```selected_columns_df = df[['column1', 'column2']]```

- Drop specific columns from the DataFrame

    ```filtered_df = df.drop(columns=['column_to_drop1', 'column_to_drop2'])```

- Filter rows with NaN values in a specific column

    ```filtered_df = df[df['column_name'].isnull()]```

- Filter rows without NaN values in a specific column
    
    ```filtered_df = df[df['column_name'].notnull()]```

### Drug mechanisms

ChEMBL has also information that we can query online to understand each drug mechanism

In [None]:
mechanism_url = base_url.format (f"mechanism?target_chembl_id__exact={chembl_id}&limit=10")
response = requests.get(mechanism_url, headers={"Accept": "application/json"})
assert(response.status_code == 200)
mechanism_request = response.json()
mechanism_request

Let's get a table

In [None]:
mechanism_table = pd.DataFrame.from_dict(mechanism_request['mechanisms'])[['molecule_chembl_id', 'target_chembl_id', 'max_phase']]
mechanism_table

### Identifiers outside ChEMBL

UNIPROT and CHEMBL have different identifiers. Let's start by finding the Chembl identifier from Uniprot.

In [None]:
target_protein_url = base_url.format("target_component?accession=P03372")
target_components = requests.get(target_protein_url, headers={"Accept":"application/json"}).json()['target_components']
target_components

In [None]:
targets_list = ';'.join([i['target_chembl_id'] for i in target_components[0]['targets']])
targets_url = base_url.format("target/set/{:s}".format(targets_list))
targets = requests.get(targets_url, headers={"Accept":"application/json"}).json()

In [None]:
pd.DataFrame([(item['target_chembl_id'], item['target_type']) for item in targets['targets']])

## PubChem

PubChem provides several ways for programmatic access to its data:

- PUG-REST: a simplified access route to PubChem without the overhead of XML or SOAP envelopes. PUG-REST provides convenient access to information on PubChem records not possible with other PUG services.

- PUG-View: REST-style web service that provides full reports, including third-party textual annotation, for individual PubChem records.

- Power User Gateway (PUG): PUG provides programmatic access to PubChem services via a single common gateway interface (CGI), available at http://pubchem.ncbi.nlm.nih.gov/pug/pug.cgi.

- PUG-SOAP: web service access to PubChem data using the Simple Object Access Protocol (SOAP).

- PubChemRDF REST interface: a REST-style interface designed to access RDF-encoded PubChem data.

- Entrez Utilities: E-utils are a set of programs used to access information contained in the Entrez system.

### PUG-REST

How PUG REST Works:
The URL Path: Most – if not all – of the information the service needs to produce its results is encoded into the URL. 


The conceptual framework of this service is the three-part request: 
- input – that is, what identifiers are we talking about (ID, name, SMILEs, InChl...); Identifier: SID, CID, AID

- operation – what to do with those identifiers; 

- output – what information should be returned: xml, sdf, png, txt,... 


**Design of the URL:**
PUG REST is entirely based on HTTP (or HTTPS) requests
"https://pubchem.ncbi.nlm.nih.gov/rest/pug"

   **Input**: 
   
            /compound/name/xxx   
	        /compound/cid/xxx
            cid | name | smiles | inchi | sdf | inchikey | formula | 
           
   **Operation:**
   
            /property/InChI
            /property/MolecularWeight
            /cids
              
              If no operation is specified at all, the default is to retrieve the entire record. 
              What operations are available    are dependent on the input domain. 


              <compound property> = property / [comma-separated list of property tags]
              Property tags: MolecularWeight, MolecularFormula,CanonicalSmiles,cids,XLogP... 
              check in the compound property table (https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest)
              
   **Output:** /TXT
            /JSON
            /PNG
            
            <output specification> = XML | ASNT | ASNB | JSON | JSONP | SDF | CSV | PNG | TXT

In [None]:
# Get PubChem CID by name

name = "ibuprofen"
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{name}/cids/JSON"

r = requests.get(url)
response = r.json()
if "IdentifierList" in response:
    cid = response["IdentifierList"]["CID"][0]
else:
    raise ValueError(f"Could not find matches for compound: {name}") #  used to raise exceptions or errors
print(f"PubChem CID for {name} is:\n{cid}")


Let's get some molecular properties

In [None]:
# Get molecular weight for ibuprofen
cid ="3672"
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/property/MolecularWeight/JSON"

r = requests.get(url)
response = r.json()

if "PropertyTable" in response:
    mol_weight = response["PropertyTable"]["Properties"][0]["MolecularWeight"]
else:
    raise ValueError(f"Could not find matches for PubChem CID: {cid}")
print(f"Molecular weight for {name} is:\n{mol_weight}")


### Similarity search in PubChem

PubChem allows for two kinds of similarity search:

- **Fastsimilarity_2d**: PubChem substructure fingerprint and Tanimoto Index

- **Fastsimilarity_3d**: Similarity is evaluated with the shape-Tanimoto (ST) and color-Tanimoto (CT) scores, which quantify the similarity between their conformers in 3-D shape and functional group orientations, respectively. The ST and CT scores are calculated using the Gaussian-shape overlay method by Grant and Pickup, and implemented in the Rapid Overlay of Chemical Structures.However, because 3-D similarity search takes much longer than 2-D similarity search, it often exceeds the 30-second time limit and returns a time-out error, especially when the query molecule is big. 3-D similarity search uses a shape-Tanimoto of >=0.80 and a color-Tanimoto of >=0.50 as a similarity threshold. 


In [None]:
smiles = "C1COCC(=O)N1C2=CC=C(C=C2)N3C[C@@H](OC3=O)CNC(=O)C4=CC=C(S4)Cl"
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsimilarity_2d/smiles/{smiles}/cids/txt?Threshold=99"

response= requests.get (url)
print (response.text)
