# Prototype queries: Compounds of [element], inorganic

Try a few different methods.

This is an experiment to determine what kind of query will give the most meaningful results. We are interested in:

- How many compounds are returned?
- What kinds of compounds? Do they match our idea of the definition of the group?

Therefore, for the purpose of this experiment, we only retrieve CIDs and output HTML summaries with graphics from PubChem.


## Setup

In [None]:
import os
import sys
# import pandas as pd
# from pandas import DataFrame

import rdkit
from rdkit import Chem

import sqlalchemy
from sqlalchemy import create_engine, Table, MetaData
from sqlalchemy.sql import select, text, and_, or_, not_

sys.path.append('../..')
from camelid.env import CamelidEnv
from camelid.cmgroup import CMGroup, collect_to_json
from camelid.query import get_query_results, substructure_query, substruct_exclude_query
from camelid.hypertext import cids_to_html, directory

In [None]:
env = CamelidEnv('test')  # For output file management

# Database connection & metadata
eng = create_engine('postgresql://akokai@localhost/chmdata')
con = eng.connect()
meta = MetaData(con)

# Remember the cpds table and its molecule column,
# to help keep query-generating code concise:
cpds = Table('cpds', meta, autoload=True)
mol = cpds.c.molecule

### Set of elements of interest

These are all the elements having an "[X] compounds, inorganic" group among our current CMGs.

In [None]:
elems_inorg = [
    'As',
    'Cd',
    'Pb',
    'Sb',
    'Ni',
    'Au',
    'Be',
    'Rh',
    'Se',
    'Sn',
    'V',
    'U',
]

### Store results...

In [None]:
cmgs = []

### Define function to get CIDs out of query result

In [None]:
def result_cids(df):
    cids = df['cid'].dropna()
    return cids

## SMARTS substructure [element], with SQL clause excluding [organics]

The question is **how to specify what "organic" patterns to exclude.**

### Try a number of different exclude patterns

In [None]:
exclude_patterns = {
    'three_c': '[C,c].[C,c].[C,c]',
    'two_c': '[C,c].[C,c]',
    'ch_bonds': '[C!H0,c!H0]',
    'carbon': '[C,c]'
}

### Execute SQL queries

In [None]:
for pat in exclude_patterns.keys():

    for elem in elems_inorg:

        # Set up parameters for a CMG:
        elem_smarts = '[{}]'.format(elem)
        id_ = elem + '_{}'.format(pat)

        params = {  # a minimal set
            'cmg_id': id_,
            'structure_type': 'SMARTS',
            'name': '{0} compounds, inorganic (excluding {1})'.format(elem, pat)
        }

        cmg = CMGroup(params, env)

        # Generate the query, save it as text, and execute it
        que = substruct_exclude_query(elem_smarts, exclude_patterns[pat], mol, [cpds.c.cid])
        sql_txt = str(que.compile(compile_kwargs={'literal_binds': True}))
        result = get_query_results(que, con)

        # Get the CIDs
        cids = result_cids(result)

        # Add summary of results to CMG
        summ = {'sql': sql_txt, '# results': len(result), '# cids': len(cids)}
        cmg.add_info(summ)

        # Output HTML page for results
        html_file = '{}.html'.format(os.path.join(cmg.results_path, cmg.cmg_id))
        cids_to_html(cids, html_file, title=cmg.name, info=cmg.info)

        # Save the CMG
        cmgs.append(cmg)

## Single-clause SMARTS substructure

I don't know how to specify SMARTS for "contains this element and not *any* carbon *anywhere*".

Instead, **experimenting with identifying "inorganic" forms of carbon (carbonate, CO, CN...):**

In [None]:
smarts_strings = {
    'inorg_c': '[{0};!$([{0}]-[C,c])].[CH0;!$(C~C[H])]'
}

In [None]:
for pat in smarts_strings.keys():

    for elem in elems_inorg:

        # Set up parameters for a CMG:
        elem_smarts = smarts_strings[pat].format(elem)
        id_ = elem + '_{}'.format(pat)

        params = {  # a minimal set
            'cmg_id': id_,
            'structure_type': 'SMARTS',
            'name': '{0} compounds, inorganic carbon ({1})'.format(elem, pat)
        }

        cmg = CMGroup(params, env)

        # Generate the query, save it as text, and execute it
        que = substructure_query(elem_smarts, mol, [cpds.c.cid])
        sql_txt = str(que.compile(compile_kwargs={'literal_binds': True}))
        result = get_query_results(que, con)

        # Get the CIDs
        cids = result_cids(result)

        # Add summary of results to CMG
        summ = {'sql': sql_txt, '# results': len(result), '# cids': len(cids)}
        cmg.add_info(summ)

        # Output HTML page for results
        html_file = '{}.html'.format(os.path.join(cmg.results_path, cmg.cmg_id))
        cids_to_html(cids, html_file, title=cmg.name, info=cmg.info)

        # Save the CMG
        cmgs.append(cmg)

## Dump all CMG objects to JSON

In [None]:
collect_to_json(cmgs, os.path.join(env.results_path, 'inorganic.json'))

## Create HTML directory of all CMGs

In [None]:
directory(cmgs, env)