# PUG_REST PubChem Search

## Introduction

This notebook shows how the PUG REST api can be used to retrieve attributes of PubChem compounds.\
A REST (**Re**presentational **S**tate **T**ransfer) api is a stateless web-based protocol that standardizes data transfer over the World-Wide Web. It allows us to retrieve data with simple web URLs.

### Questions

How can we search using a compounds name?\
How can we retrieve multiple attributes about each compound at once?\
How can we use a PUG api call to get RdKit Mol objects?

### Learning Objectives

Search by name for one compound\
Search by molecular weight for multiple compounds\
Convert SMILES to Mol objects in mass

### Purpose

This notebook is designed to demonstrate the useful endpoints of the PUG REST api, as well as how to use the api to get large sets of Mol objects based on a PubChem search programatically.

## Libraries

A list of libraries that will need to be installed and imported to complete the tasks in the notebook.

| Library | Contents | Source |
| :-----: | :------- | :----- |
| requests | library for sending HTTP/1.1 requests | [PyPi page on requests](https://pypi.org/project/requests/) |
| rdkit | libarary for cheminformatics and machine-learning modules | [rdkit documentation](https://www.rdkit.org/) |

## Installation

These libraries will need to be installed in your computing environment to perform the tasks in this notebook.

To install from the command line on your computer, use this command (with the `json` library as the example):

`pip install json`

To install from within a Jupyter notebook or CoLab notebook, you need to type the same command in a coding cell, preceded by an exclamation point.

`!pip install json`

These libraries will be imported as they are needed over the course of this notebook.


## Notebook Contents

The next section of the notebook includes all of the raw code for this example. **Experienced coder** should use this as you see fit, either in this notebook or in your preferred environment.

For **novice and intermediate coders**, the code is divided into sequential coding cells that each perform one step in the process. This notebook includes the following steps:

1. Imports
2. Search by name
3. Search by molecular weight range
4. Retrieve specified property
5. Retrieve SMILES
6. Retrieve multiple properties
7. Convert SMILES into RdKit Mol objects

In [None]:
# Full block of raw code for EXPERIENCED CODERS
import requests # python requests package
from rdkit import Chem # rdkit Chem module

name = "glucose"
request = requests.get(f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{name}/JSON") # access the PUG REST route
response = request.json() # parse the JSON data retreived
print(len(response["PC_Compounds"])) # print the number of compounds retreived
response # display the data retreived

weight_minimum = 400.0
weight_maximum = 400.05
target_value = "cids"
request = requests.get(f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/molecular_weight/range/{weight_minimum}/{weight_maximum}/{target_value}/JSON") # access the PUG REST route
response = request.json() # parse the JSON data retreived
print(len(response["IdentifierList"]["CID"])) # print the number of compounds retreived
response # display the data retreived

weight_minimum = 400.0
weight_maximum = 400.05
target_value = "property/InChiKey"
request = requests.get(f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/molecular_weight/range/400.0/400.05/{target_value}/JSON") # access the PUG REST route
response = request.json() # parse the JSON data retreived
print(len(response["PropertyTable"]["Properties"])) # print the number of compounds retreived
response # display the data retreived

weight_minimum = 400.0
weight_maximum = 400.05
target_value = "property/SMILES"
request = requests.get(f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/molecular_weight/range/400.0/400.05/{target_value}/JSON") # access the PUG REST route
response = request.json() # parse the JSON data retreived
print(len(response["PropertyTable"]["Properties"])) # print the number of compounds retreived
response # display the data retreived

weight_minimum = 400.0
weight_maximum = 400.05
target_value = "property/SMILES,MolecularWeight,MolecularFormula"
request = requests.get(f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/molecular_weight/range/400.0/400.05/{target_value}/JSON") # access the PUG REST route
response = request.json() # parse the JSON data retreived
print(len(response["PropertyTable"]["Properties"])) # print the number of compounds retreived
response # display the data retreived

def generateMolsFromRequest(property_table): # define the function, taking the property table data as an input
    mols = []
    properties = property_table["Properties"] # pull out the list of retreived property values
    for compound in properties: # loop through each compound's properties
        smiles = compound["SMILES"] # extract the SMILES string value
        mol = Chem.MolFromSmiles(smiles) # convert the SMILES to a Mol object using RdKit
        mols.append(mol) # add the Mol object to a list
    return mols # return the list of converted Mol objects

mols = generateMolsFromRequest(response["PropertyTable"]) # call the function using the previously retrieved data, and assign the returned list to a variable
print(mols[0:5]) # print the first 5 Mols in the list
Chem.Draw.MolsToGridImage(mols[0:5]) # draw the first 5 Mols in the list

### Imports


In [None]:
import requests # python requests package
from rdkit import Chem # rdkit Chem module

### Search by name
This api route retrieves all of the compound information for the one specified name, and returns the data in JSON format

In [None]:
name = "glucose"
request = requests.get(f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{name}/JSON") # access the PUG REST route
response = request.json() # parse the JSON data retreived
print(len(response["PC_Compounds"])) # print the number of compounds retreived
response # display the data retreived

### Search by molecular weight range
This api route retrieves the CIDs for the all of the compunds that are within the molecular weight range, and returns the data in JSON format

In [None]:
weight_minimum = 400.0
weight_maximum = 400.05
target_value = "cids"
request = requests.get(f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/molecular_weight/range/{weight_minimum}/{weight_maximum}/{target_value}/JSON") # access the PUG REST route
response = request.json() # parse the JSON data retreived
print(len(response["IdentifierList"]["CID"])) # print the number of compounds retreived
response # display the data retreived

### Retrieve specified property
This api route retrieves the InChiKeys for the all of the compunds that are within the molecular weight range, and returns the data in JSON format

In [None]:
weight_minimum = 400.0
weight_maximum = 400.05
target_value = "property/InChiKey"
request = requests.get(f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/molecular_weight/range/400.0/400.05/{target_value}/JSON") # access the PUG REST route
response = request.json() # parse the JSON data retreived
print(len(response["PropertyTable"]["Properties"])) # print the number of compounds retreived
response # display the data retreived

### Retrieve SMILES
This api route retrieves the SMILES strings for the all of the compunds that are within the molecular weight range, and returns the data in JSON format

In [None]:
weight_minimum = 400.0
weight_maximum = 400.05
target_value = "property/SMILES"
request = requests.get(f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/molecular_weight/range/400.0/400.05/{target_value}/JSON") # access the PUG REST route
response = request.json() # parse the JSON data retreived
print(len(response["PropertyTable"]["Properties"])) # print the number of compounds retreived
response # display the data retreived

### Retrieve multiple properties
This api route retrieves the SMILES strings, molecular weights, and molecular formulas for the all of the compunds that are within the molecular weight range, and returns the data in JSON format

In [None]:
weight_minimum = 400.0
weight_maximum = 400.05
target_value = "property/SMILES,MolecularWeight,MolecularFormula"
request = requests.get(f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/molecular_weight/range/400.0/400.05/{target_value}/JSON") # access the PUG REST route
response = request.json() # parse the JSON data retreived
print(len(response["PropertyTable"]["Properties"])) # print the number of compounds retreived
response # display the data retreived

### Convert SMILES into RdKit Mol objects
This is a python function to parse all SMILES strings from our JSON data

In [None]:
def generateMolsFromRequest(property_table): # define the function, taking the property table data as an input
    mols = []
    properties = property_table["Properties"] # pull out the list of retreived property values
    for compound in properties: # loop through each compound's properties
        smiles = compound["SMILES"] # extract the SMILES string value
        mol = Chem.MolFromSmiles(smiles) # convert the SMILES to a Mol object using RdKit
        mols.append(mol) # add the Mol object to a list
    return mols # return the list of converted Mol objects

mols = generateMolsFromRequest(response["PropertyTable"]) # call the function using the previously retrieved data, and assign the returned list to a variable
print(mols[0:5]) # print the first 5 Mols in the list
Chem.Draw.MolsToGridImage(mols[0:5]) # draw the first 5 Mols in the list