## NCI Open Database

This notebook check if any of the fragment-like set compounds have experimental logPs reposted in the NCI Open Database.

I downloaded NCI Open Database August 2006 release `ncidb_August2006.sdf ` to ./ncidb directory, because this version
included experimental logP values. This directory is not included in the repository due to size. It can be downloaded
from https://cactus.nci.nih.gov/download/nci

Before running this notebook, convert SDF file of database to csv for easier manipulation:  
$ python convert.py ncidb_August2006.sdf  ncidb_August2006.csv


In [1]:
import pandas as pd
import pickle
import numpy as np
from openeye import oechem, oedepict, oemolprop

In [4]:
# Import dataframe of selecyed drug-like set compounds 
df_drug = pd.read_csv("df_drug_final.csv", index_col=[0])

# Convert canonical isomeric SMILES to canonical SMILES for comparison to NCI Open Database/
df_drug["canonical SMILES"] = None

ifs = oechem.oemolistream()
ofs = oechem.oemolostream()

ifs.SetFormat(oechem.OEFormat_ISM)
ofs.SetFormat(oechem.OEFormat_CAN)

for i, row in df_drug.iterrows():
    #print(i, row["eMolecules ID"])
    can_iso_smiles = row["canonical isomeric SMILES"]
    mol = oechem.OEGraphMol()
    oechem.OESmilesToMol(mol, can_iso_smiles)
    canonical_smiles = oechem.OECreateCanSmiString(mol)
    df_drug.loc[i, "canonical SMILES"]= canonical_smiles

df_drug.head()

Unnamed: 0_level_0,canonical isomeric SMILES,eMolecules SMILES,"pKas in [3,11]",XlogP,MolWt,Availability (mg),Price,group,N_Rot,N_UV_chrom,Selection,Bin index,Priority,Final list,canonical SMILES
eMolecules ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
536848,c1cc2c(cc(c(c2nc1)O)I)I,Ic1cc(I)c2c(c1O)nccc2,"[3.511, 6.794]",3.371,396.951,239.0,168.0,drug-like,0,29,picked,0.0,1,True,c1cc2c(cc(c(c2nc1)O)I)I
4375254,CCOC(=O)c1ccc(cc1)Nc2cc(nc(n2)Nc3ccc(cc3)C(=O)...,CCOC(=O)c1ccc(cc1)Nc1cc(C)nc(n1)Nc1ccc(cc1)C(=...,[6.336],2.937,420.461,319.0,168.0,drug-like,10,28,picked,0.0,3,True,CCOC(=O)c1ccc(cc1)Nc2cc(nc(n2)Nc3ccc(cc3)C(=O)...
18897105,c1ccc2c(c1)c(=O)[nH]c(n2)CCC(=O)Nc3ncc(s3)Cc4c...,O=C(Nc1ncc(s1)Cc1ccc(c(c1)F)F)CCc1nc2ccccc2c(=...,"[9.381, 10.773]",3.341,426.439,247.7,223.0,drug-like,7,37,picked,0.0,2,True,c1ccc2c(c1)c(=O)[nH]c(n2)CCC(=O)Nc3ncc(s3)Cc4c...
1574612,c1cc(cc(c1)Br)Nc2c(cnc(n2)Nc3cccc(c3)Br)F,Brc1cccc(c1)Nc1ncc(c(n1)Nc1cccc(c1)Br)F,[3.892],4.14,438.092,222.0,168.0,drug-like,4,28,picked,2.0,1,True,c1cc(cc(c1)Br)Nc2c(cnc(n2)Nc3cccc(c3)Br)F
3365457,CCOc1ccc2c(c1)sc(n2)NC(=O)Cc3ccc(c(c3)Cl)Cl,CCOc1ccc2c(c1)sc(n2)NC(=O)Cc1ccc(c(c1)Cl)Cl,[9.167],5.171,381.276,489.9,148.0,drug-like,6,28,picked,3.0,1,True,CCOc1ccc2c(c1)sc(n2)NC(=O)Cc3ccc(c(c3)Cl)Cl


In [5]:
# Read NCI Open Database CSV file
df_ncidb = pd.read_csv("./ncidb/ncidb_August2006.csv")
df_ncidb.head()

Unnamed: 0,SMILES,TITLE,NSC Number,Molecular Weight,ACD Name,Availabe Name Set,CAS Number,Formula,SMILES.1,KOW logP,Experimental logP
0,CC1=CC(=O)C=CC1=O,1,1,122.1232,"2-methylbenzo-1,4-quinone","2-methylbenzo-1,4-quinone (ACD/Name); p-Benzoq...",553-97-9,C7H6O2,CC1=CC(=O)C=CC1=O,0.79,0.72
1,c1ccc2c(c1)nc(s2)SSc3nc4ccccc4s3,2,2,332.4706,"2-(1,3-benzothiazol-2-yldithio)-1,3-benzothiaz...","2-(1,3-benzothiazol-2-yldithio)-1,3-benzothiaz...",120-78-5,C14H8N2S4,C4=CC1=C(N=C(S1)SSC3=NC2=CC=CC=C2S3)C=C4,,
2,c1c(cc(c(c1[N+](=O)[O-])O)Cl)[N+](=O)[O-],3,3,218.5531,"2-chloro-4,6-bis(hydroxy(oxido)amino)phenol","2-chloro-4,6-bis(hydroxy(oxido)amino)phenol (A...",946-31-6,C6H3ClN2O5,[O-][N+](=O)C1=CC(=CC(=C1O)Cl)[N+](=O)[O-],2.37,
3,[H]/N=c\1/[nH]cc(s1)[N+](=O)[O-],4,4,145.1356,"5-(hydroxy(oxido)amino)-2-imino-2,3-dihydro-1,...","5-(hydroxy(oxido)amino)-2-imino-2,3-dihydro-1,...",121-66-4,C3H3N3O2S,[O-][N+](=O)C1=CNC(=N)S1,-0.44,
4,c1ccc2c(c1)C(=O)c3ccc(cc3C2=O)N,5,5,223.2306,"2-aminoanthra-9,10-quinone","2-aminoanthra-9,10-quinone (ACD/Name); .beta.-...",117-79-3,C14H9NO2,O=C1C3=C(C(=O)C2=C1C=CC=C2)C=CC(=C3)N,2.43,


In [6]:
# Search fragment-like set compounds in NCI Open Database with canonical SMILES
drug_can_smiles_list = list(df_drug["canonical SMILES"])
df_ncidb.loc[df_ncidb["SMILES"].isin(drug_can_smiles_list)]

Unnamed: 0,SMILES,TITLE,NSC Number,Molecular Weight,ACD Name,Availabe Name Set,CAS Number,Formula,SMILES.1,KOW logP,Experimental logP
8541,c1cc2c(cc(c(c2nc1)O)I)I,8704,8704,396.9536,"5,7-diiodo-8-quinolinol","5,7-diiodo-8-quinolinol (ACD/Name); component ...",83-73-8,C9H5I2NO,OC1=C2C(=C(I)C=C1I)C=CC=N2,4.0,
60186,c1cc2c(cc(c(c2nc1)O)I)I,74939,74939,396.9536,"5,7-diiodo-8-quinolinol","5,7-diiodo-8-quinolinol (ACD/Name)",,C9H5I2NO,OC1=C2C(=C(I)C=C1I)C=CC=N2,4.0,


In [7]:
# Search fragment-like set compounds in NCI Open Database with canonical isomeric SMILES
drug_can_iso_smiles_list = list(df_drug["canonical isomeric SMILES"])
df_ncidb.loc[df_ncidb["SMILES"].isin(drug_can_iso_smiles_list)]

Unnamed: 0,SMILES,TITLE,NSC Number,Molecular Weight,ACD Name,Availabe Name Set,CAS Number,Formula,SMILES.1,KOW logP,Experimental logP
8541,c1cc2c(cc(c(c2nc1)O)I)I,8704,8704,396.9536,"5,7-diiodo-8-quinolinol","5,7-diiodo-8-quinolinol (ACD/Name); component ...",83-73-8,C9H5I2NO,OC1=C2C(=C(I)C=C1I)C=CC=N2,4.0,
60186,c1cc2c(cc(c(c2nc1)O)I)I,74939,74939,396.9536,"5,7-diiodo-8-quinolinol","5,7-diiodo-8-quinolinol (ACD/Name)",,C9H5I2NO,OC1=C2C(=C(I)C=C1I)C=CC=N2,4.0,


Only one of the selected compounds of fragment-like set matched a record in NCIDB. But it doesn't have any experimental logP. So I don't have to replace any of these compounds.