## Map compound to targets and pathways

Many of the compounds are annoted to specific MOAs, and some are also annotated to targets.
Previously, we only analyzed performance of MOA prediction, but what about target prediction, and, further, pathway prediction.

Here, we:

1. Map compounds to the Drug Repurposing Hub target annotations, and
2. Use publicly available resources to map targets to pathways

In [1]:
import pathlib
import pandas as pd

In [2]:
# Load target file
commit = "58c86d50ec58af5adae330ac7e4329841c1e30e7"
target_map_file = f"https://github.com/broadinstitute/lincs-cell-painting/blob/{commit}/metadata/moa/repurposing_info_long.tsv?raw=true"

target_df = pd.read_csv(target_map_file, sep="\t", low_memory=False)

print(target_df.shape)
target_df.head(2)

(39471, 21)


Unnamed: 0,broad_id,pert_iname,clinical_phase,moa,target,disease_area,indication,qc_incompatible,purity,vendor,...,vendor_name,expected_mass,smiles,InChIKey,pubchem_cid,deprecated_broad_id,InChIKey14,repurposing_info_index,moa_unique,target_unique
0,BRD-K76022557-003-28-9,(R)-(-)-apomorphine,Launched,dopamine receptor agonist,ADRA2A|ADRA2B|ADRA2C|CALY|DRD1|DRD2|DRD3|DRD4|...,neurology/psychiatry,Parkinson's Disease,0,98.9,MedChemEx,...,Apomorphine (hydrochloride hemihydrate),267.126,CN1CCc2cccc-3c2[C@H]1Cc1ccc(O)c(O)c-31,VMWNQDUVQKEIOC-CYBMUJFWSA-N,6005.0,,VMWNQDUVQKEIOC,0,dopamine receptor agonist,ADRA2A
1,BRD-K76022557-003-28-9,(R)-(-)-apomorphine,Launched,dopamine receptor agonist,ADRA2A|ADRA2B|ADRA2C|CALY|DRD1|DRD2|DRD3|DRD4|...,neurology/psychiatry,Parkinson's Disease,0,98.9,MedChemEx,...,Apomorphine (hydrochloride hemihydrate),267.126,CN1CCc2cccc-3c2[C@H]1Cc1ccc(O)c(O)c-31,VMWNQDUVQKEIOC-CYBMUJFWSA-N,6005.0,,VMWNQDUVQKEIOC,0,dopamine receptor agonist,ADRA2B


In [3]:
# Load moa file
moa_file = pathlib.Path("data", "split_moas_cpds.csv")
moa_df = pd.read_csv(moa_file)

print(moa_df.shape)
moa_df.head()

(1571, 5)


Unnamed: 0,pert_iname,moa,train,test,marked
0,ketoprofen,cyclooxygenase inhibitor,True,False,True
1,valdecoxib,cyclooxygenase inhibitor,False,True,True
2,epirizole,cyclooxygenase inhibitor,True,False,True
3,ketorolac,cyclooxygenase inhibitor,True,False,True
4,balsalazide,cyclooxygenase inhibitor,True,False,True


In [4]:
# Note, this long dataframe labels compounds per unique MOA
# In other words, compounds that have multiple MOAs appear in more than one row
moa_df.pert_iname.value_counts()

ursolic-acid                11
bardoxolone-methyl          10
ellagic-acid                 7
ginkgolide-b                 7
betulinic-acid               7
                            ..
rotundine                    1
bemegride                    1
ammonium-glycyrrhizinate     1
sorbinil                     1
dilazep                      1
Name: pert_iname, Length: 1258, dtype: int64

In [5]:
# Merge moa with target info
target_subset_df = target_df.loc[:, 
    ["pert_iname", "moa", "target_unique", "clinical_phase", "disease_area", "indication"]
]

# To match moa dataframe
target_subset_df['moa'] = target_subset_df['moa'].astype(str)
target_subset_df['moa'] = target_subset_df['moa'].apply(lambda x: x.lower())

target_subset_df

Unnamed: 0,pert_iname,moa,target_unique,clinical_phase,disease_area,indication
0,(R)-(-)-apomorphine,dopamine receptor agonist,ADRA2A,Launched,neurology/psychiatry,Parkinson's Disease
1,(R)-(-)-apomorphine,dopamine receptor agonist,ADRA2B,Launched,neurology/psychiatry,Parkinson's Disease
2,(R)-(-)-apomorphine,dopamine receptor agonist,ADRA2C,Launched,neurology/psychiatry,Parkinson's Disease
3,(R)-(-)-apomorphine,dopamine receptor agonist,CALY,Launched,neurology/psychiatry,Parkinson's Disease
4,(R)-(-)-apomorphine,dopamine receptor agonist,DRD1,Launched,neurology/psychiatry,Parkinson's Disease
...,...,...,...,...,...,...
39466,9-aminocamptothecin,topoisomerase inhibitor,TOP1,Phase 2,,
39467,9-anthracenecarboxylic-acid,,ANO1,Preclinical,,
39468,9-anthracenecarboxylic-acid,,CLCN1,Preclinical,,
39469,9-anthracenecarboxylic-acid,,ANO1,Preclinical,,


In [6]:
moa_target_df = (
    moa_df
    .merge(
        target_subset_df,
        left_on=["pert_iname", "moa"],
        right_on=["pert_iname", "moa"],
        how="left"
    )
    .drop_duplicates()
    .reset_index(drop=True)
)

print(moa_target_df.shape)
moa_target_df.head()

(3178, 9)


Unnamed: 0,pert_iname,moa,train,test,marked,target_unique,clinical_phase,disease_area,indication
0,ketoprofen,cyclooxygenase inhibitor,True,False,True,PTGS1,Launched,rheumatology,rheumatoid arthritis|osteoarthritis
1,ketoprofen,cyclooxygenase inhibitor,True,False,True,PTGS2,Launched,rheumatology,rheumatoid arthritis|osteoarthritis
2,ketoprofen,cyclooxygenase inhibitor,True,False,True,SLC5A8,Launched,rheumatology,rheumatoid arthritis|osteoarthritis
3,valdecoxib,cyclooxygenase inhibitor,False,True,True,CA12,Withdrawn,,
4,valdecoxib,cyclooxygenase inhibitor,False,True,True,PTGS2,Withdrawn,,


In [7]:
# Make sure no perturbations have been dropped
assert len(moa_target_df.pert_iname.unique()) == len(moa_df.pert_iname.unique())

In [8]:
len(moa_target_df.moa.unique())

501

In [9]:
len(moa_target_df.target_unique.unique())

744

In [10]:
# Output file for pathway mapping
output_file = pathlib.Path("data", "split_moas_targets_cpds.csv")
moa_target_df.to_csv(output_file, index=False)