# Chemical database initialization 1: Structures

## Goal

Bootstrap a chemical database with ~700,000 structures from the US EPA CompTox Dashboard's public dataset.

Set up the database so that it can be used for substructure searching via the [RDKit PostgreSQL database cartridge](http://www.rdkit.org/docs/Cartridge.html).


## Data source

`dsstox_20160701.tsv` - *Mapping file of InChIStrings, InChIKeys and DTXSIDs for the EPA CompTox Dashboard,* [available on Figshare](https://figshare.com/articles/Mapping_file_of_InChIStrings_InChIKeys_and_DTXSIDs_for_the_EPA_CompTox_Dashboard/3578313), published 12.08.2016 by Antony Williams. 
- Date: 2016-07-01
- License: CC0
- Columns: DTXSID (US EPA internal ID), InChI, InChIKey


## Notes on software dependencies

Requires:
- a running instance of PostgreSQL with the RDKit cartridge installed;
- Python packages and dependencies: rdkit, sqlalchemy, psycopg2, pandas.

In [1]:
import pandas as pd
import csv
from pandas import DataFrame, Series
from rdkit import Chem
from rdkit.Chem import AllChem
from sqlalchemy import create_engine, types
from sqlalchemy.sql import text

## Generate database table of structural representations

- Taking the list of EPA InChI(Key)s and DSSTox substance IDs, convert each InChI into a RDKit `Mol` object. Then convert each `Mol` into its binary representation.
- Create a big table in PostgreSQL that adds a binary molecule-object column to the original EPA dataset. In other words, a table with columns `(dtxsid, inchi, inchikey, bin)`.
  - This is a necessary intermediate step because there is no `mol_from_inchi` method in the PGSQL RDKit extension, but there is a `mol_from_pkl` that builds molecules out of binary representations. Otherwise we could go straight from InChI to molecules in the SQL table.
  - The 720K rows seems to be too much to process in memory all at once, so I am going through the file lazily in chunks.

### Notes
- Use all-lowercase column names to avoid SQL mix-ups.
- RDKit will fail to create many of the molecules from InChI because of very specific errors. The number of molecules we have in the end will probably be less than 700K.

In [2]:
DTX_DATA = '/opt/akokai/data/EPA/dsstox-20160701.tsv'

In [3]:
!echo "DSSTOX dataset length: $(wc -l /opt/akokai/data/EPA/dsstox-20160701.tsv)"

DSSTOX dataset length: 719996 /opt/akokai/data/EPA/dsstox-20160701.tsv


In [4]:
# To be able to re-run this code from scratch, first drop the table if it already exists:
!psql chmdata -c 'drop table dtx;'

DROP TABLE


In [4]:
conn = create_engine('postgresql://akokai@localhost/chmdata')

This will take a while...

In [6]:
dtypes = {'dtxsid': types.String,
          'inchi': types.String,
          'inchikey': types.String,
          'bin': types.Binary}

ninput = 719996
ncreated = 0
chunk = 10000

dtx = pd.read_table(DTX_DATA, names=['dtxsid', 'inchi', 'inchikey'],
                    chunksize=chunk, low_memory=True)


for df in dtx:
    df['mol'] = df.inchi.apply(Chem.MolFromInchi)
    df.dropna(inplace=True)
    n = len(df)
    ncreated += n
    print('{0} molecules created, {1} errors'.format(n, chunk - n))
    df['bin'] = df.mol.apply(lambda m: m.ToBinary())
    df.drop('mol', axis=1, inplace=True)
    df.to_sql('dtx', conn, if_exists='append', index=False, chunksize=65536, dtype=dtypes)

print('Total: {0} molecules created, {1} errors'.format(ncreated, ninput - ncreated))

9996 molecules created, 4 errors
9985 molecules created, 15 errors
9994 molecules created, 6 errors
9995 molecules created, 5 errors
9995 molecules created, 5 errors
9993 molecules created, 7 errors
9997 molecules created, 3 errors
9989 molecules created, 11 errors
9996 molecules created, 4 errors
9993 molecules created, 7 errors
9989 molecules created, 11 errors
9993 molecules created, 7 errors
9992 molecules created, 8 errors
9988 molecules created, 12 errors
10000 molecules created, 0 errors
10000 molecules created, 0 errors
9998 molecules created, 2 errors
9996 molecules created, 4 errors
9994 molecules created, 6 errors
9993 molecules created, 7 errors
9998 molecules created, 2 errors
10000 molecules created, 0 errors
10000 molecules created, 0 errors
10000 molecules created, 0 errors
10000 molecules created, 0 errors
9999 molecules created, 1 errors
9998 molecules created, 2 errors
9996 molecules created, 4 errors
9990 molecules created, 10 errors
9998 molecules created, 2 errors

## Generate `mol`-type column and index structures

Create a new table with columns `(dtxsid, inchi, inchikey, molecule)` where the last column contains RDKit `mol`-type structures. Index the table on the structures using the GiST-powered RDKit extension. (This is what enables substructure searching in SQL.)

In [7]:
# To be able to re-run the code below, first drop the table:
!psql chmdata -c 'drop table chem;'

DROP TABLE


In [8]:
cmd = text(
    '''create table chem
       as select dtxsid, inchi, inchikey, mol_from_pkl(bin) molecule from dtx;''')
res = conn.execute(cmd)
print(res.rowcount)

719632


### Check results

In [9]:
assert res.rowcount == ncreated

In [5]:
# Check that the table contains expected data... 
cmd = text('select * from chem limit 5;')
conn.execute(cmd).fetchall()

[('DTXSID7020001', 'InChI=1S/C11H9N3/c12-10-6-5-8-7-3-1-2-4-9(7)13-11(8)14-10/h1-6H,(H3,12,13,14)', 'FJTNLJLPLJDTRM-UHFFFAOYSA-N', 'N=c1ccc2c([nH]1)[nH]c1ccccc12'),
 ('DTXSID5039224', 'InChI=1S/C2H4O/c1-2-3/h2H,1H3', 'IKHGUXGNUITLKF-UHFFFAOYSA-N', 'CC=O'),
 ('DTXSID50872971', 'InChI=1S/C4H8N2O/c1-3-5-6(2)4-7/h3-4H,1-2H3/b5-3+', 'IMAGWKUTFZRWSB-HWKANZROSA-N', 'C/C=N/N(C)C=O'),
 ('DTXSID2020004', 'InChI=1S/C2H5NO/c1-2-3-4/h2,4H,1H3/b3-2+', 'FZENGILVLUJGJX-NSCUHMNNSA-N', 'C/C=N/O'),
 ('DTXSID7020005', 'InChI=1S/C2H5NO/c1-2(3)4/h1H3,(H2,3,4)', 'DLFVBJFMPXGRIB-UHFFFAOYSA-N', 'CC(=N)O')]

### Create the index

It takes a while...

In [11]:
cmd = text('create index molidx on chem using gist(molecule);')
res = conn.execute(cmd)