# Chemical database initialization

## Goal

Bootstrap a chemical database with ~750,000 structures from the US EPA CompTox Dashboard's public dataset.

Set up the database so that it can be used for substructure searching via the [RDKit PostgreSQL database cartridge](http://www.rdkit.org/docs/Cartridge.html).


## Data source

`dsstox_20160701.tsv` - *Mapping file of InChIStrings, InChIKeys and DTXSIDs for the EPA CompTox Dashboard,* [available on Figshare](https://figshare.com/articles/Mapping_file_of_InChIStrings_InChIKeys_and_DTXSIDs_for_the_EPA_CompTox_Dashboard/3578313), published 12.08.2016 by Antony Williams. 
- Date: 2016-07-01
- License: CC0

## Notes on software dependencies

Requires:
- a running instance of PostgreSQL with the RDKit cartridge installed;
- Python packages and dependencies: rdkit, sqlalchemy, psycopg2, pandas.

In [1]:
import pandas as pd
from pandas import DataFrame, Series
from rdkit import Chem
from rdkit.Chem import AllChem  #, Draw
# SQLAlchemy also needs psycopg2 installed
from sqlalchemy import create_engine, types, Table, Column, MetaData
from sqlalchemy.sql import select, text

EPA_DATA = '/opt/akokai/data/EPA_iCSS/dsstox_20160701.tsv'

# Generate molecules from structural IDs

- Read in the list of EPA DSSTox substance IDs.
    - Use all-lowercase column names to avoid SQL mix-ups.
- Convert each supplied InChI into a RDKit `Mol` object.
- Convert each `Mol` into its binary representation.
    - The RDKit PostgreSQL database cartridge can convert binary strings back to molecules in the database.
    - There is no `mol_from_inchi` method in the PGSQL RDKit extension.

This code will need to be rewritten to properly handly 750K rows, instead of just 1K.

In [10]:
dtx = pd.read_table(EPA_DATA, names=['dtxsid', 'inchi', 'inchikey'], nrows=1000)
dtx['mols'] = dtx.inchi.apply(Chem.MolFromInchi)
dtx.dropna(inplace=True)
print(len(dtx), 'molecules')
dtx['bin'] = dtx.mols.apply(lambda m: m.ToBinary())
dtx.drop('mols', axis=1, inplace=True)
dtx.head()

999 molecules


Unnamed: 0,dtxsid,inchi,inchikey,bin
0,DTXSID7020001,InChI=1S/C11H9N3/c12-10-6-5-8-7-3-1-2-4-9(7)13...,FJTNLJLPLJDTRM-UHFFFAOYSA-N,b'\xef\xbe\xad\xde\x00\x00\x00\x00\x07\x00\x00...
1,DTXSID5039224,"InChI=1S/C2H4O/c1-2-3/h2H,1H3",IKHGUXGNUITLKF-UHFFFAOYSA-N,b'\xef\xbe\xad\xde\x00\x00\x00\x00\x07\x00\x00...
2,DTXSID50872971,"InChI=1S/C4H8N2O/c1-3-5-6(2)4-7/h3-4H,1-2H3/b5-3+",IMAGWKUTFZRWSB-HWKANZROSA-N,b'\xef\xbe\xad\xde\x00\x00\x00\x00\x07\x00\x00...
3,DTXSID2020004,"InChI=1S/C2H5NO/c1-2-3-4/h2,4H,1H3/b3-2+",FZENGILVLUJGJX-NSCUHMNNSA-N,b'\xef\xbe\xad\xde\x00\x00\x00\x00\x07\x00\x00...
4,DTXSID7020005,"InChI=1S/C2H5NO/c1-2(3)4/h1H3,(H2,3,4)",DLFVBJFMPXGRIB-UHFFFAOYSA-N,b'\xef\xbe\xad\xde\x00\x00\x00\x00\x07\x00\x00...


# Copy into new database table

In [3]:
conn = create_engine('postgresql://akokai@localhost/chmdata')

In [16]:
# To be able to re-run the code block below, first drop the table:
!psql chmdata -c 'drop table epa;'

DROP TABLE


In [17]:
dtypes = {'dtxsid': types.String,
          'inchi': types.String,
          'inchikey': types.String,
          'bin': types.Binary}

dtx.to_sql('epa', conn, if_exists='fail', index=False, chunksize=65536, dtype=dtypes)

### Just make sure it's there...
Can delete these cells after table creation & reflection are known to work.

In [13]:
# Reflect this table in an SQLAlchemy object.
meta_epa = MetaData()
epa = Table('epa', meta_epa, autoload=True, autoload_with=conn)

In [14]:
[c.name for c in epa.columns]

['dtxsid', 'inchi', 'inchikey', 'bin']

# Generate `mol`-type column in the table

1. Either create a new table with `(dtxsid, inchi, ..., molecule)` *or* add column `molecule` to table `epa`.

2. Index the molecules using the GiST-powered RDKit extension:

```
create index molidx on epa using gist(molecule);
```

In [18]:
# Test to see if mol_from_bin actually works... yes!
cmd = text(
    'select mol_from_pkl(bin) from epa limit 5;')
conn.execute(cmd).fetchall()

[('N=c1ccc2c([nH]1)[nH]c1ccccc12',),
 ('CC=O',),
 ('C/C=N/N(C)C=O',),
 ('C/C=N/O',),
 ('CC(=N)O',)]

### 'Alter table' method

Works, but maybe not desirable because:
- the resulting table would have an unneeded column containing the binary strings, and
- if anything goes wrong, we'd need to regenerate the whole set of `Mol` objects and binary representations again...

In [23]:
cmd = text(
    '''alter table epa add column molecule mol;
       update epa set molecule = mol_from_pkl(bin);''')
res = conn.execute(cmd)

In [27]:
# Test to see if that worked... yes!
cmd = text('select * from epa limit 5;')
conn.execute(cmd).fetchall()

[('DTXSID7020429', 'InChI=1S/C2Cl2/c3-1-2-4', 'ZMJOVJSTYLQINE-UHFFFAOYSA-N', <memory at 0x7f9627dc8948>, 'ClC#CCl'),
 ('DTXSID2020711', 'InChI=1S/ClH/h1H', 'VEXZGXHMUGYJMC-UHFFFAOYSA-N', <memory at 0x7f9627dc8b88>, 'Cl'),
 ('DTXSID4020874', 'InChI=1S/CH6N2/c1-3-2/h3H,2H2,1H3', 'HDZGCSFEDULWCS-UHFFFAOYSA-N', <memory at 0x7f9627dc8c48>, 'CNN'),
 ('DTXSID7020001', 'InChI=1S/C11H9N3/c12-10-6-5-8-7-3-1-2-4-9(7)13-11(8)14-10/h1-6H,(H3,12,13,14)', 'FJTNLJLPLJDTRM-UHFFFAOYSA-N', <memory at 0x7f9627dc8d08>, 'N=c1ccc2c([nH]1)[nH]c1ccccc12'),
 ('DTXSID5039224', 'InChI=1S/C2H4O/c1-2-3/h2H,1H3', 'IKHGUXGNUITLKF-UHFFFAOYSA-N', <memory at 0x7f9627dc8dc8>, 'CC=O')]

### 'Select into' method

We are creating a new table called `chem` that has only the desired columns.

In [22]:
cmd = text(
    '''select (dtxsid, inchi, inchikey, mol_from_pkl(bin))
       from epa
       limit 5;''')
conn.execute(cmd).fetchall()

[('(DTXSID7020001,"InChI=1S/C11H9N3/c12-10-6-5-8-7-3-1-2-4-9(7)13-11(8)14-10/h1-6H,(H3,12,13,14)",FJTNLJLPLJDTRM-UHFFFAOYSA-N,"N=c1ccc2c([nH]1)[nH]c1ccccc12")',),
 ('(DTXSID5039224,"InChI=1S/C2H4O/c1-2-3/h2H,1H3",IKHGUXGNUITLKF-UHFFFAOYSA-N,CC=O)',),
 ('(DTXSID50872971,"InChI=1S/C4H8N2O/c1-3-5-6(2)4-7/h3-4H,1-2H3/b5-3+",IMAGWKUTFZRWSB-HWKANZROSA-N,"C/C=N/N(C)C=O")',),
 ('(DTXSID2020004,"InChI=1S/C2H5NO/c1-2-3-4/h2,4H,1H3/b3-2+",FZENGILVLUJGJX-NSCUHMNNSA-N,C/C=N/O)',),
 ('(DTXSID7020005,"InChI=1S/C2H5NO/c1-2(3)4/h1H3,(H2,3,4)",DLFVBJFMPXGRIB-UHFFFAOYSA-N,"CC(=N)O")',)]

In [50]:
# To be able to re-run the code block below, first drop the table:
!psql chmdata -c 'drop table chem;'

DROP TABLE


In [52]:
cmd = text(
    '''create table chem
       as select dtxsid, inchi, inchikey, mol_from_pkl(bin) molecule from epa;''')
res = conn.execute(cmd)

In [53]:
# Test to see if that worked... YES!
cmd = text('select * from chem limit 5;')
conn.execute(cmd).fetchall()

[('DTXSID7020001', 'InChI=1S/C11H9N3/c12-10-6-5-8-7-3-1-2-4-9(7)13-11(8)14-10/h1-6H,(H3,12,13,14)', 'FJTNLJLPLJDTRM-UHFFFAOYSA-N', 'N=c1ccc2c([nH]1)[nH]c1ccccc12'),
 ('DTXSID5039224', 'InChI=1S/C2H4O/c1-2-3/h2H,1H3', 'IKHGUXGNUITLKF-UHFFFAOYSA-N', 'CC=O'),
 ('DTXSID50872971', 'InChI=1S/C4H8N2O/c1-3-5-6(2)4-7/h3-4H,1-2H3/b5-3+', 'IMAGWKUTFZRWSB-HWKANZROSA-N', 'C/C=N/N(C)C=O'),
 ('DTXSID2020004', 'InChI=1S/C2H5NO/c1-2-3-4/h2,4H,1H3/b3-2+', 'FZENGILVLUJGJX-NSCUHMNNSA-N', 'C/C=N/O'),
 ('DTXSID7020005', 'InChI=1S/C2H5NO/c1-2(3)4/h1H3,(H2,3,4)', 'DLFVBJFMPXGRIB-UHFFFAOYSA-N', 'CC(=N)O')]

In [54]:
# Check to see column names.
meta_chem = MetaData()
chem = Table('chem', meta_chem, autoload=True, autoload_with=conn)
[c.name for c in chem.columns]

  (attype, name))


['dtxsid', 'inchi', 'inchikey', 'molecule']

In [55]:
# Hopefully this works if it produces no errors...
cmd = text('create index molidx on chem using gist(molecule);')
res = conn.execute(cmd)