# Chemical database initialization 2: CIDs

## Goal

Generate PubChem CIDs for the set of structures that were added to the database in the previous notebook. Create a table relating US EPA DTXSID, **CID**, InChIKey, InChI, and molecule objects.

## Approach
Try to find CIDs for each InChIKey using the [PubChem ID exchange](https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi). (Could also try matching by full InChI string, but it appears to be much slower.) Select the 1:1 correspondences, then use SQL to do an inner join of `(inchkey, cid)` and `(inchikey, ..., molecule)`.

## Notes on software dependencies

Requires:
- a running instance of PostgreSQL with the RDKit cartridge installed;
  - and the database generated in the previous notebook;
- Python packages and dependencies: rdkit, sqlalchemy, psycopg2, pandas.

In [1]:
import pandas as pd
import csv
from pandas import DataFrame, Series
from rdkit import Chem
from rdkit.Chem import AllChem
from sqlalchemy import create_engine, types
from sqlalchemy.sql import text

## Find 1:1 mappings of InChIKeys to CIDs

### Prepare files to upload to PubChem ID exchange

Extract the **InChIKeys**, and split the resulting file into two files with less than 500K lines each (required for ID exchange service).

In [2]:
!cat /opt/akokai/data/EPA/dsstox-20160701.tsv | awk -F '\t' '{print $3}' > /opt/akokai/data/EPA/dsstox-inchikey.txt
!split -l 360000 /opt/akokai/data/EPA/dsstox-inchikey.txt /opt/akokai/data/EPA/dsstox-inchikey- --additional-suffix .txt

After running ID conversion...

### Create table of InChIKey:CID mappings

In [3]:
CIDS_FILES = ['/opt/akokai/data/EPA/dsstox-cid-inchikey-1.txt',
              '/opt/akokai/data/EPA/dsstox-cid-inchikey-2.txt']

cids = pd.concat([pd.read_table(f, names=['inchikey', 'cid'], dtype=str)
                  for f in CIDS_FILES])
cids.dropna(inplace=True)
print('InChIKey-CID mappings:', len(cids))
cids.head()

InChIKey-CID mappings: 731405


Unnamed: 0,inchikey,cid
0,FJTNLJLPLJDTRM-UHFFFAOYSA-N,62805
1,IKHGUXGNUITLKF-UHFFFAOYSA-N,177
2,IMAGWKUTFZRWSB-HWKANZROSA-N,9548611
3,FZENGILVLUJGJX-NSCUHMNNSA-N,5324279
4,DLFVBJFMPXGRIB-UHFFFAOYSA-N,178


### Keep only 1:1 mappings

Most of the results are 1:1 mappings... for now at least, it's easier to just work with those.

In [4]:
# import matplotlib
# %matplotlib inline
multi = cids.groupby('inchikey')['cid'].count()
# multi.hist()

In [5]:
cids['multi'] = cids['inchikey'].apply(lambda i: multi[i])
cids = cids[cids['multi'] == 1].drop('multi', axis=1)
print(len(cids), '1:1 InChIKey:CID mappings')
cids.head()

689963 1:1 InChIKey:CID mappings


Unnamed: 0,inchikey,cid
0,FJTNLJLPLJDTRM-UHFFFAOYSA-N,62805
1,IKHGUXGNUITLKF-UHFFFAOYSA-N,177
2,IMAGWKUTFZRWSB-HWKANZROSA-N,9548611
3,FZENGILVLUJGJX-NSCUHMNNSA-N,5324279
4,DLFVBJFMPXGRIB-UHFFFAOYSA-N,178


In [6]:
conn = create_engine('postgresql://akokai@localhost/chmdata')

In [7]:
# To be able to re-run this code from scratch, first drop the table if it already exists:
!psql chmdata -c 'drop table cids;'

DROP TABLE


In [8]:
dtypes = {'inchikey': types.String, 'cid': types.Integer}
cids.to_sql('cids', conn, if_exists='fail', index=False, chunksize=65536, dtype=dtypes)

In [9]:
# Check that the table contains expected data... 
cmd = text('select * from cids limit 5;')
conn.execute(cmd).fetchall()

[('FJTNLJLPLJDTRM-UHFFFAOYSA-N', 62805),
 ('IKHGUXGNUITLKF-UHFFFAOYSA-N', 177),
 ('IMAGWKUTFZRWSB-HWKANZROSA-N', 9548611),
 ('FZENGILVLUJGJX-NSCUHMNNSA-N', 5324279),
 ('DLFVBJFMPXGRIB-UHFFFAOYSA-N', 178)]

## Join CIDs table into table of molecular structures

Inner join on `inchikey` column of tables `chem` (previous notebook, contains molecular structures) and `cids`.

In [10]:
# To be able to re-run this code from scratch, first drop the table if it already exists:
!psql chmdata -c 'drop table cpds;'

DROP TABLE


In [11]:
cmd = text(
    '''create table cpds
       as select chem.dtxsid, cids.cid, chem.inchikey, chem.inchi, chem.molecule
       from chem, cids
       where chem.inchikey = cids.inchikey;''')
res = conn.execute(cmd)
print(res.rowcount)

689629


## Recreate the molecular structure index on the new table

This takes a while.

In [12]:
cmd = text('create index cpdidx on cpds using gist(molecule);')
res = conn.execute(cmd)