[PubChem](https://pubchem.ncbi.nlm.nih.gov/) is a database of chemical molecules, their properties, structures, activities against bioassays etc. It is maintained by National Center for Biotechnology Information (NCBI). The information present in the database can be retrieved via their [PUG REST API](http://pubchemdocs.ncbi.nlm.nih.gov/pug-rest).

Each molecule in the database has a unique ID known as the Compound ID (CID). Corresponding to each CID, you have it's SMILES, Molecular properties, IUPAC etc.

Thankfully, Python has a wrapper around the PUG REST API called [Pubchempy](https://pubchempy.readthedocs.io/en/latest/guide/introduction.html), which takes care of nitty gritties involved in retrieving data from a REST API. 

In [1]:
# Import library
import pubchempy as pcp

### Retrieving cid

Pubchempy allows retreival of compounds on the basis of name, smiles, sdf, inchi, inchikey or formula

In [2]:
results = pcp.get_compounds('Glucose', 'name')
results

[Compound(5793)]

### Retrieving data from cid

In [None]:
# Retrieving a compound from a cid
c = pcp.Compound.from_cid(5793)

# Getting smiles
print("Molecular formula: ", c.molecular_formula)
print("Canonical SMILES: ", c.canonical_smiles)
print("INCHI: ", c.inchi)
print("INCHI Key: ", c.inchikey)
print("IUPAC Name: ", c.iupac_name)
print("\nSome properties:")
print("Atom stereo count = ", c.atom_stereo_count)
print("Bond stereo count = ", c.bond_stereo_count)
print("Charge = ", c.charge)
print("Exact Mass = ", c.exact_mass)

Molecular formula:  C6H12O6
Canonical SMILES:  C(C1C(C(C(C(O1)O)O)O)O)O
INCHI:  InChI=1S/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2/t2-,3-,4+,5-,6?/m1/s1
INCHI Key:  WQZGKKKJIJFFOK-GASJEMHNSA-N
IUPAC Name:  (3R,4S,5S,6R)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol

Some properties:
Atom stereo count =  5
Bond stereo count =  0
Charge =  0
Exact Mass =  180.063


### Pandas integration

In [None]:
df1 = pcp.get_compounds('C20H41Br', 'formula', as_dataframe=True)

In [None]:
df1.head()

There are many more features available as well, including retrieving 3d properties, similarity searching etc. Read the documentation: https://pubchempy.readthedocs.io/en/latest/guide/gettingstarted.html

### Can I retrieve data from cas number?

Not 100% reliable but gets the job done

In [None]:
import requests
from bs4 import BeautifulSoup

def extract_id_type(soup):
    """ Takes a BS4 object of a pubchem page as input and returns the
    Pubchem ID type and Pubchem ID Value. """
    
    pubhcem_uid_type = soup.find_all(
        'meta', {'name': 'pubchem_uid_type'})[0]['content']
    pubhcem_uid_value = soup.find_all(
        'meta', {'name': 'pubchem_uid_value'})[0]['content']
    
    return pubhcem_uid_type + ':' + pubhcem_uid_value


def get_pubchem(cas):
    """ Extract the mappings to pubchem ids of a given CAS number """

    # Get the search page.
    url = 'https://www.ncbi.nlm.nih.gov/pccompound?term="{}"'.format(cas)
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')

    # In case the search page redirects, extract pubchem id type and value.
    try:
        return extract_id_type(soup)

    # Otherwise, get all the returned links and extract pubchem id type and value.
    except IndexError:

        # If only exact results are returned.
        if (not 'Quoted phrase not found' in r.text) and (not 'Did you mean: ' in r.text):
            pubmed_results = [pr.find_all('p', {"class": "title"})[
                0] for pr in soup.find_all('div', {"class": "rprt"})]
            links = [pr.find_all('a')[0]['href'] for pr in pubmed_results]
            pubchem_ids = list()
            for link in links:
                r = requests.get(link)
                soup = BeautifulSoup(r.text, "lxml")
                pubchem_ids.append(extract_id_type(soup))
            return pubchem_ids
        # No results found.
        else:
            return []

In [None]:
get_pubchem('50-99-7')