## NCI Open Database

This notebook check if any of the fragment-like set compounds have experimental logPs reposted in the NCI Open Database.

I downloaded NCI Open Database August 2006 release `ncidb_August2006.sdf ` to ./ncidb directory, because this version
included experimental logP values. This directory is not included in the repository due to size. It can be downloaded
from https://cactus.nci.nih.gov/download/nci

Before running this notebook, convert SDF file of database to csv for easier manipulation:  
$ python convert.py ncidb_August2006.sdf  ncidb_August2006.csv


In [12]:
import pandas as pd
import pickle
import numpy as np
from openeye import oechem, oedepict, oemolprop

In [13]:
# Import dataframe of selecyed fragment-like set compounds 
df_frag = pd.read_csv("df_frag_final.csv", index_col=[0])

# Convert canonical isomeric SMILES to canonical SMILES for comparison to NCI Open Database/
df_frag["canonical SMILES"] = None

ifs = oechem.oemolistream()
ofs = oechem.oemolostream()

ifs.SetFormat(oechem.OEFormat_ISM)
ofs.SetFormat(oechem.OEFormat_CAN)

for i, row in df_frag.iterrows():
    #print(i, row["eMolecules ID"])
    can_iso_smiles = row["canonical isomeric SMILES"]
    mol = oechem.OEGraphMol()
    oechem.OESmilesToMol(mol, can_iso_smiles)
    canonical_smiles = oechem.OECreateCanSmiString(mol)
    df_frag.loc[i, "canonical SMILES"]= canonical_smiles

df_frag.head()

Unnamed: 0,eMolecules ID,canonical isomeric SMILES,eMolecules SMILES,"pKas in [3,11]",XlogP,MolWt,Availability (mg),Price,group,N_Rot,N_UV_chrom,Selection,Bin index,Priority,Final list,canonical SMILES
0,6679830,c1cc2c(cc1O)c3c(o2)C(=O)NCCC3,Oc1cc2c3CCCNC(=O)c3oc2cc1,[9.119],0.72,217.221,184.0,533.0,fragment-like,0,27,picked,4,1,True,c1cc2c(cc1O)c3c(o2)C(=O)NCCC3
1,719540,c1ccc(cc1)n2c3c(cn2)c(ncn3)N,Nc1ncnc2c1cnn2c1ccccc1,[3.869],1.499,211.223,3430.0,414.0,fragment-like,1,31,picked,8,1,True,c1ccc(cc1)n2c3c(cn2)c(ncn3)N
2,37095168,c1ccc2c(c1)ncn2c3ccc(cc3)O,Oc1ccc(cc1)n1cnc2c1cccc2,"[5.82, 8.709]",2.219,210.231,21650.2,148.0,fragment-like,1,40,picked,11,1,True,c1ccc2c(c1)ncn2c3ccc(cc3)O
3,37053191,c1ccc(cc1)c2[nH]c3ccc(cc3n2)C(=O)N,NC(=O)c1ccc2c(c1)nc([nH]2)c1ccccc1,[6.342],2.192,237.257,2000.0,168.0,fragment-like,2,42,picked,11,2,True,c1ccc(cc1)c2[nH]c3ccc(cc3n2)C(=O)N
4,31653344,c1ccc(cc1)n2cnc3c2ccc(c3)N,Nc1ccc2c(c1)ncn2c1ccccc1,[6.348],2.333,209.247,50213.0,148.0,fragment-like,1,40,picked,12,1,True,c1ccc(cc1)n2cnc3c2ccc(c3)N


In [14]:
# Read NCI Open Database CSV file
df_ncidb = pd.read_csv("./ncidb/ncidb_August2006.csv")
df_ncidb.head()

Unnamed: 0,SMILES,TITLE,NSC Number,Molecular Weight,ACD Name,Availabe Name Set,CAS Number,Formula,SMILES.1,KOW logP,Experimental logP
0,CC1=CC(=O)C=CC1=O,1,1,122.1232,"2-methylbenzo-1,4-quinone","2-methylbenzo-1,4-quinone (ACD/Name); p-Benzoq...",553-97-9,C7H6O2,CC1=CC(=O)C=CC1=O,0.79,0.72
1,c1ccc2c(c1)nc(s2)SSc3nc4ccccc4s3,2,2,332.4706,"2-(1,3-benzothiazol-2-yldithio)-1,3-benzothiaz...","2-(1,3-benzothiazol-2-yldithio)-1,3-benzothiaz...",120-78-5,C14H8N2S4,C4=CC1=C(N=C(S1)SSC3=NC2=CC=CC=C2S3)C=C4,,
2,c1c(cc(c(c1[N+](=O)[O-])O)Cl)[N+](=O)[O-],3,3,218.5531,"2-chloro-4,6-bis(hydroxy(oxido)amino)phenol","2-chloro-4,6-bis(hydroxy(oxido)amino)phenol (A...",946-31-6,C6H3ClN2O5,[O-][N+](=O)C1=CC(=CC(=C1O)Cl)[N+](=O)[O-],2.37,
3,[H]/N=c\1/[nH]cc(s1)[N+](=O)[O-],4,4,145.1356,"5-(hydroxy(oxido)amino)-2-imino-2,3-dihydro-1,...","5-(hydroxy(oxido)amino)-2-imino-2,3-dihydro-1,...",121-66-4,C3H3N3O2S,[O-][N+](=O)C1=CNC(=N)S1,-0.44,
4,c1ccc2c(c1)C(=O)c3ccc(cc3C2=O)N,5,5,223.2306,"2-aminoanthra-9,10-quinone","2-aminoanthra-9,10-quinone (ACD/Name); .beta.-...",117-79-3,C14H9NO2,O=C1C3=C(C(=O)C2=C1C=CC=C2)C=CC(=C3)N,2.43,


In [18]:
# Search fragment-like set compounds in NCI Open Database with canonical SMILES
frag_can_smiles_list = list(df_frag["canonical SMILES"])
df_ncidb.loc[df_ncidb["SMILES"].isin(frag_can_smiles_list)]

Unnamed: 0,SMILES,TITLE,NSC Number,Molecular Weight,ACD Name,Availabe Name Set,CAS Number,Formula,SMILES.1,KOW logP,Experimental logP
1389,c1ccc(cc1)n2c3c(cn2)c(ncn3)N,1401,1401,211.2256,"1-phenyl-1H-pyrazolo[3,4-d]pyrimidin-4-ylamine...","1-phenyl-1H-pyrazolo[3,4-d]pyrimidin-4-ylamine...",5334-30-5,C11H9N5,NC1=C3C(=NC=N1)[N](C2=CC=CC=C2)N=C3,1.04,


In [17]:
# Search fragment-like set compounds in NCI Open Database with canonical isomeric SMILES
frag_can_iso_smiles_list = list(df_frag["canonical isomeric SMILES"])
df_ncidb.loc[df_ncidb["SMILES"].isin(frag_can_iso_smiles_list)]

Unnamed: 0,SMILES,TITLE,NSC Number,Molecular Weight,ACD Name,Availabe Name Set,CAS Number,Formula,SMILES.1,KOW logP,Experimental logP
1389,c1ccc(cc1)n2c3c(cn2)c(ncn3)N,1401,1401,211.2256,"1-phenyl-1H-pyrazolo[3,4-d]pyrimidin-4-ylamine...","1-phenyl-1H-pyrazolo[3,4-d]pyrimidin-4-ylamine...",5334-30-5,C11H9N5,NC1=C3C(=NC=N1)[N](C2=CC=CC=C2)N=C3,1.04,


Only one of the selected compounds of fragment-like set matched a record in NCIDB. But it doesn't have any experimental logP. So I don't have to replace any of these compounds.