We have the following problem: drug names are not unique in the respective dataframes. E.g., sometimes, they are using different synonyms, sometimes, there are typos, sometimes, there is random capitalization. The first resort was retrieving the PubChem IDs. However, this opened up new problems. PubChem IDs are not unique, as it turns out. This notebook tries to see how large the problem is.

In [1]:
import pandas as pd
import pubchempy as pcp

In [3]:
all_datasets = {
    "CCLE": "/Users/judithbernett/PycharmProjects/drp_model_suite/data/CCLE/CCLE.csv",
    "CTRPv1": "/Users/judithbernett/PycharmProjects/drp_model_suite/data/CTRPv1/CTRPv1.csv",
    "CTRPv2": "/Users/judithbernett/PycharmProjects/drp_model_suite/data/CTRPv2/CTRPv2.csv",
    "GDSC1": "/Users/judithbernett/PycharmProjects/drp_model_suite/data/GDSC1/GDSC1.csv",
    "GDSC2": "/Users/judithbernett/PycharmProjects/drp_model_suite/data/GDSC2/GDSC2.csv",
}
all_drug_names = pd.DataFrame(columns=["pubchem_id","drug_name", "dataset"])
for dataset, file in all_datasets.items():
    df = pd.read_csv(file)
    df = df[["pubchem_id", "drug_name"]].drop_duplicates()
    df["dataset"] = dataset
    all_drug_names = pd.concat([all_drug_names, df])
all_drug_names

  df = pd.read_csv(file)
  df = pd.read_csv(file)


Unnamed: 0,pubchem_id,drug_name,dataset
0,11656518,RAF265,CCLE
1,24180719,PLX4720,CCLE
2,10127622,AZD6244,CCLE
3,10302451,AZD0530,CCLE
4,10461815,PHA-665752,CCLE
...,...,...,...
229715,3899.0,Leflunomide,GDSC2
231502,864.0,alpha-lipoic acid,GDSC2
232237,124886.0,glutathione,GDSC2
232969,12035.0,N-acetyl cysteine,GDSC2


We will now iterate through all drug names. For each drug name, we will retrieve the PubChem ID, the synonyms, the IUPAC name, the canonical smiles, the cactvs fingerprints, the fingerprint.

In [30]:
drug_names = set(all_drug_names["drug_name"])
iupac_hashmap = {}
multiple_drug_names = []
no_drug_names = []
result_df = pd.DataFrame(columns=["drug_name", "pubchem_id", "iupac_name", "canonical_smiles", "cactvs_fingerprint", "fingerprint"])
for idx, drug_name in enumerate(drug_names):
    if idx % 10 == 0:
        print(f"Processing {idx} of {len(drug_names)}")
    try:
        compound = pcp.get_compounds(drug_name, namespace="name")
        if len(compound) == 0:
            raise ValueError(f"No compound found for {drug_name}")
        elif len(compound) > 1:
            raise ValueError(f"Multiple compounds found for {drug_name}")
        compound = compound[0]
        iupac_hash = hash(compound.iupac_name)
        if iupac_hash not in iupac_hashmap:
            iupac_hashmap[iupac_hash] = [drug_name]
        elif drug_name not in iupac_hashmap[iupac_hash]:
            print(f"Collision for {drug_name}: {iupac_hashmap[iupac_hash]}")
            iupac_hashmap[iupac_hash].append(drug_name)
        result_df = pd.concat([result_df, pd.DataFrame.from_dict({
            "drug_name": [drug_name],
            "pubchem_id": [compound.cid],
            "iupac_name": [compound.iupac_name],
            "canonical_smiles": [compound.canonical_smiles],
            "cactvs_fingerprint": [compound.cactvs_fingerprint],
            "fingerprint": [compound.fingerprint]
        })
        ])
    except Exception as e:
        print(f"Error for {drug_name}: {e}")
        if "No compound found" in str(e):
            no_drug_names.append(drug_name)
        elif "Multiple compounds found" in str(e):
            multiple_drug_names.append(drug_name)

Processing 0 of 1253
Error for navitoclax:sorafenib (1:1 mol/mol): No compound found for navitoclax:sorafenib (1:1 mol/mol)
Error for BRD-K75293299: Multiple compounds found for BRD-K75293299
Error for erlotinib:PLX-4032 (2:1 mol/mol): No compound found for erlotinib:PLX-4032 (2:1 mol/mol)
Processing 10 of 1253
Error for BRD-A02303741: Multiple compounds found for BRD-A02303741
Error for PARP_9495: No compound found for PARP_9495
Error for navitoclax:gemcitabine (1:1 mol/mol): No compound found for navitoclax:gemcitabine (1:1 mol/mol)
Error for ascorbate (vitamin C): No compound found for ascorbate (vitamin C)
Processing 20 of 1253
Error for BRD-K55116708: Multiple compounds found for BRD-K55116708
Error for navitoclax:MST-312 (1:1 mol/mol): No compound found for navitoclax:MST-312 (1:1 mol/mol)
Error for SR-II-138A: No compound found for SR-II-138A
Error for PBD-288: No compound found for PBD-288
Processing 30 of 1253
Error for IAP_7638: No compound found for IAP_7638
Error for 1S,3R-

In [31]:
print("Overall stats: ")
print(f"Number of drug names: {len(drug_names)}")
print(f"Number of drug names with multiple compounds: {len(multiple_drug_names)}")
print(f"Number of drug names with no compounds: {len(no_drug_names)}")
iupac_hash_collisions = sum([len(v) for v in iupac_hashmap.values() if len(v) > 1])
print(f"Number of IUPAC hash collisions: {iupac_hash_collisions}")

Overall stats: 
Number of drug names: 1253
Number of drug names with multiple compounds: 110
Number of drug names with no compounds: 258
Number of IUPAC hash collisions: 226


In [32]:
iupac_hash_collisions = [v for v in iupac_hashmap.values() if len(v) > 1]

In [34]:
pd.DataFrame(multiple_drug_names, columns=["drug_name"]).to_csv("multiple_drug_names.csv", index=False)
pd.DataFrame(no_drug_names, columns=["drug_name"]).to_csv("no_drug_names.csv", index=False)
pd.DataFrame(iupac_hash_collisions, columns=["drug_name_1", "drug_name_2", "drug_name_3"]).to_csv("iupac_hash_collisions.csv", index=False)

In [35]:
result_df.to_csv("all_results.csv", index=False)

First task: harmonize the iupac collisions. Find out whether they also got different PubChem IDs.

In [45]:
import numpy as np
iupac_flat = pd.DataFrame(iupac_hash_collisions, columns=["drug_name_1", "drug_name_2", "drug_name_3"]).to_numpy().flatten()
result_subset = result_df[result_df["drug_name"].isin(iupac_flat)]
for collision_list in iupac_hash_collisions:
    pubchem_ids = result_subset[result_subset["drug_name"].isin(collision_list)]["pubchem_id"].unique()
    if len(pubchem_ids) > 1:
        print(f"Collision for {collision_list}: {pubchem_ids}")

Great news - they all got the same PubChemIDs. We will harmonize the names in a separate csv.

In [48]:
hash_collisions_resolved = pd.read_csv("iupac_hash_collisions_resolved.csv", index_col=0)
mapping_dict = {}
for idx, row in hash_collisions_resolved.iterrows():
    for drug_name in row:
        if not pd.isna(drug_name):
            mapping_dict[drug_name] = idx

Next, we will resolve the multi-ID cases. We will create a csv with the resolved names by always taking the first PubChem ID.

In [50]:
multi_mapping_resolved = pd.DataFrame(columns=["pubchem_id", "drug_name"])
for drug in multiple_drug_names:
    cpd = pcp.get_compounds(drug, namespace="name")[0]
    multi_mapping_resolved = pd.concat([multi_mapping_resolved, pd.DataFrame.from_dict({
        "pubchem_id": [cpd.cid],
        "drug_name": [drug]
    })])
    result_df = pd.concat([result_df, pd.DataFrame.from_dict({
            "drug_name": [drug],
            "pubchem_id": [cpd.cid],
            "iupac_name": [cpd.iupac_name],
            "canonical_smiles": [cpd.canonical_smiles],
            "cactvs_fingerprint": [cpd.cactvs_fingerprint],
            "fingerprint": [cpd.fingerprint]
        })
        ])

In [52]:
multi_mapping_resolved[multi_mapping_resolved.duplicated(subset="pubchem_id")]

Unnamed: 0,pubchem_id,drug_name
0,6918289,temsirolimus
0,135398738,Dacarbazine
0,216345,nutlin-3
0,7251185,parthenolide
0,73707371,PRL-3 inhibitor I


In [53]:
mapping_dict.update({
    "Nutlin-3": "Nutlin-3",
    "nutlin-3": "Nutlin-3",
    "Temsirolimus": "Temsirolimus",
    "temsirolimus": "Temsirolimus",
    "Parthenolide": "Parthenolide",
    "parthenolide": "Parthenolide",
    "PRL-3 Inhibitor I": "PRL-3 Inhibitor I",
    "PRL-3 inhibitor I": "PRL-3 Inhibitor I",
    "Dacarbazine": "Dacarbazine",
    "dacarbazine": "Dacarbazine"
})

In [54]:
multi_mapping_resolved.to_csv("multi_mapping_resolved.csv", index=False)

Next issue: unmappable drug names. Let's see how many of them we found so far.

In [63]:
subsetted_drug_names = all_drug_names[all_drug_names["drug_name"].isin(no_drug_names)]
subsetted_drug_names = subsetted_drug_names.sort_values(by="drug_name")

In [64]:
subsetted_drug_names.to_csv("no_drug_names_resolved.csv", index=False)

We manually change some of these mappings and adapt the mapping dictionary accordingly.

In [None]:
new_mappings = {
    "(-)-gallocatechin-3-monogallate": "(-)-Gallocatechin gallate",
    "16-beta-bromoandrosterone": "16beta-Bromoandrosterone",
    "1S,3R-RSL-3": "(1S,3R)-Rsl3",
    "2,4-dideoxy-DC-45-A2": "BRD-K41087962-001-01-7",
    "2-bromopyruvate": "3-Bromopyruvic acid",
    "2-deoxyglucose": "2-Deoxy-D-arabino-hexopyranose",
    "4-methylfasudil": "5-(1,4-Diazepan-1-ylsulfonyl)-4-methylisoquinoline",
    "5-benzyl-9-tert-butyl-paullone": "CHEMBL575106",
    "968": "Glutaminase C-IN-1",
    "AA-COCF3": "Aacocf3",
    "BRD-A86708339": "TCMDC-125552",
    "BRD-K03536150": "SCHEMBL16273428",
    "BRD-K07442505": "BAM7",
    "BRD-K20514654": "BRD-K20514654-001-01-8",
    "BRD-K29313308": "BRD3308",
    "BRD-K35604418": "mim1",
    "BRD-K63431240": "BRD1240",
    "BRD-K64610608": "MLS003179190",
    "BRD-K70511574": "HMS3654N14",
    "BRD-K79669418": "CHEMBL5275075",
    "BRD-K88742110": "Pci-34051",
    "BRD0713": "(2R,3R,4S)-4-(hydroxymethyl)-3-phenyl-1-propylazetidine-2-carbonitrile",
    "BRD1812": "ICG-001",
    "BRD2572": "1-[(1R,5S)-7-phenyl-6-propyl-3,6-diazabicyclo[3.1.1]heptan-3-yl]ethanone",
    "BRD4046": "[(2S,3R,4R)-4-(aminomethyl)-3-phenyl-1-propylazetidin-2-yl]methanol",
    "BRD4372": "1-[(1S,2aS,8bS)-1-(hydroxymethyl)-2-propyl-1,2a,3,8b-tetrahydroazeto[2,3-c]quinolin-4-yl]ethanone",
    "BRD4470": "BRD-4470",
    "BRD55319": "BRD-K53855319-001-01-2",
    "BRD5586": "5-phenyl-1,7-dihydrotetrazolo[1,5-a]pyrimidine",
    "BRD63610": "BRD-K85563610-001-01-0",
    "BRD6368": "BRD-K14696368-001-01-8",
    "BRD6430": "BRD-K19796430-001-01-5",
    "BRD6825": "BRD-96825",
    "BRD7137": "5H-quinolino[8,7-c][1,2]benzothiazine 6,6-dioxide",
    "BRD8097": "(1S,5R)-3-methylsulfonyl-7-phenyl-6-propyl-3,6-diazabicyclo[3.1.1]heptane",
    "BRD8418": "(1R,5S)-7-phenyl-6-propyl-3,6-diazabicyclo[3.1.1]heptane",
    "BRD8958": "C646",
    "Brivanib, BMS-540215": "Brivanib:BMS-540215",
    "CAP-232, TT-232, TLN-232": "CAP-232:TT-232:TLN-232",
    "CID 2853753": "2-(1,2-Dihydroimidazo[1,2-a]benzimidazol-4-yl)-1-(4-phenylphenyl)ethanone",
    "Cetuximab": "C225",
    "Compound 110": "BRD-K03618428-001-01-3",
    "Compound 11e": "VX-11e",
    "Compound 12": "BRD-K40892394-001-01-9",
    "Compound 2": "BCATc Inhibitor 2",
    "Compound 4": "PTP1B-IN-3",
    "Compound 44": "CHEMBL5270701",
    "DC-45-A2": "BRD-K79983625-001-01-1",
    "ISOX": "Bml-281",
    "KRAS (G12C) Inhibitor-12": "K-Ras(G12C) inhibitor 12",
    "ML334 diastereomer": "BRD-K93367411-001-03-3",
    "MPS-1-IN-1": "Mps1-IN-1",
    "MetAP2 Inhibitor, A832234": "MetAP2 Inhibitor:A832234",
    "Nutlin-3a (-)": "Rebemadlin",
    "P-0850": "BRAF inhibitor",
    "PP-30": "BRD-K30677119-001-01-0",
    "Picolinici-acid": "Picolinic acid",
    "QW-BI-011": "BRD-4770",
    "SR-II-138A": "Rohinitib",
    "T-5345967": "ML-031",
    "TTNBP": "Arotinoid acid",
    "VAF-347": "Vaf347",
    "Venotoclax": "Venetoclax",
    "YL54": "BRD-K58306044-001-01-3",
    "ascorbate (vitamin C)": "Ascorbic acid",
    "ceranib-2": "CHEMBL4788167",
    "compound 1B": "Dnmdp",
    "eEF2K Inhibitor, A-484954": "a-484954",
    "erastin-A8": "SCHEMBL4462685",
    "m-3M3-FBS": "Phospholipase",
    "racemic-2,4-dideoxy-DC-45-A2": "SCHEMBL18710180",
    "tipifarnib-P1": "Tipifarnib S enantiomer",
    "tipifarnib-P2": "Tipifarnib",
}

In [96]:
pattern1 = "CCCCCCCCCCCCCCCC(=O)N[C@@H](CCC(=O)NCCCC[C@@H](C(=O)N[C@@H](CCC(=O)O)C(=O)N[C@@H](CC1=CC=CC=C1)C(=O)N[C@@H]([C@@H](C)CC)C(=O)N[C@@H](C)C(=O)N[C@@H](CC2=CNC3=CC=CC=C32)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](C(C)C)C(=O)N[C@@H](CCCNC(=N)N)C(=O)NCC(=O)N[C@@H](CCCNC(=N)N)C(=O)NCC(=O)O)NC(=O)[C@H](C)NC(=O)[C@H](C)NC(=O)[C@H](CCC(=O)N)NC(=O)CNC(=O)[C@H](CCC(=O)O)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CC4=CC=C(C=C4)O)NC(=O)[C@H](CO)NC(=O)[C@H](CO)NC(=O)[C@H](C(C)C)NC(=O)[C@H](CC(=O)O)NC(=O)[C@H](CO)NC(=O)[C@H]([C@@H](C)O)NC(=O)[C@H](CC5=CC=CC=C5)NC(=O)[C@H]([C@@H](C)O)NC(=O)CNC(=O)[C@H](CCC(=O)O)NC(=O)[C@H](C)NC(=O)[C@H](CC6=CN=CN6)N)C(=O)O"
pattern2 = "CC(C)C[C@H](NC(=O)[C@H](C)NC(=O)[C@H](CCCCN)NC(=O)[C@H](CCCNC(N)=N)NC(C)=O)C(=O)N[C@@H](CCC(O)=O)C(=O)N[C@@H]([C@@H](C)O)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCCNC(N)=N)C(=O)N[C@@H](CCCNC(N)=N)C(=O)N[C@@H](C(C)C)C(=O)NCC(=O)N[C@@H](CC(O)=O)C(=O)NCC(=O)N[C@@H](C(C)C)C(=O)N[C@@]1(C)CCC\C=C/CCC[C@](C)(NC(=O)[C@H](Cc2cnc[nH]2)NC(=O)[C@H](CC(N)=O)NC(=O)[C@H](CCCNC(N)=N)NC1=O)C(=O)N[C@@H]([C@@H](C)O)C(=O)N[C@@H](C)C(=O)N[C@@H](Cc1ccccc1)C(N)=O"
from rdkit import Chem
canon_smiles1 = Chem.CanonSmiles(pattern1)
canon_smiles2 = Chem.CanonSmiles(pattern2)
print(canon_smiles1)
print("vs.")
print(canon_smiles2)
print(canon_smiles1 == canon_smiles2)

CCCCCCCCCCCCCCCC(=O)N[C@@H](CCC(=O)NCCCC[C@H](NC(=O)[C@H](C)NC(=O)[C@H](C)NC(=O)[C@H](CCC(N)=O)NC(=O)CNC(=O)[C@H](CCC(=O)O)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](Cc1ccc(O)cc1)NC(=O)[C@H](CO)NC(=O)[C@H](CO)NC(=O)[C@@H](NC(=O)[C@H](CC(=O)O)NC(=O)[C@H](CO)NC(=O)[C@@H](NC(=O)[C@H](Cc1ccccc1)NC(=O)[C@@H](NC(=O)CNC(=O)[C@H](CCC(=O)O)NC(=O)[C@H](C)NC(=O)[C@@H](N)Cc1cnc[nH]1)[C@@H](C)O)[C@@H](C)O)C(C)C)C(=O)N[C@@H](CCC(=O)O)C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@H](C(=O)N[C@@H](C)C(=O)N[C@@H](Cc1c[nH]c2ccccc12)C(=O)N[C@@H](CC(C)C)C(=O)N[C@H](C(=O)N[C@@H](CCCNC(=N)N)C(=O)NCC(=O)N[C@@H](CCCNC(=N)N)C(=O)NCC(=O)O)C(C)C)[C@@H](C)CC)C(=O)O
vs.
CC(=O)N[C@@H](CCCNC(=N)N)C(=O)N[C@@H](CCCCN)C(=O)N[C@@H](C)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCC(=O)O)C(=O)N[C@H](C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCCNC(=N)N)C(=O)N[C@@H](CCCNC(=N)N)C(=O)N[C@H](C(=O)NCC(=O)N[C@@H](CC(=O)O)C(=O)NCC(=O)N[C@H](C(=O)N[C@@]1(C)CCC/C=C\CCC[C@@](C)(C(=O)N[C@H](C(=O)N[C@@H](C)C(=O)N[C@@H](Cc2ccccc2)C(N)=O)[C@@H](C)O)NC(=O)[C@H](Cc2cnc[nH]2)NC(=O)[C@H