HKRCPI: Heterogeneous Kernel Representation for Compound-Protein Interaction Prediction A Heterogeneous Kernel-based method that perdict classification of a compound and protein pair as interacting or non-interacting.This network captures protein features and Circular Fingerprint repersatation of SMILES.
Abstract Motivation: Machine learning based prediction of compound-protein interactions (CPIs) is important for drug design, screening and repurposing studies and can improve the efficiency and cost-effectiveness of wet lab assays. Despite the publication of many research papers reporting CPI predictors in the recent years, we have observed a number of fundamental issues in experiment design that lead to over optimistic estimates of model performance. Results: In this paper, we analyze the impact of several important factors affecting generalization perfor-mance of CPI predictors that are overlooked in existing work:
- Similarity between training and test examples in cross-validation
- The strategy for generating negative examples, in the absence of experimentally verified negative examples.
- Choice of evaluation protocols and performance metrics and their alignment with real-world use of CPI predictors in screening large compound libraries. Using both an existing state-of-the-art method (CPI-NN) and a proposed kernel based approach, we have found that assessment of predictive performance of CPI predictors requires careful con-trol over similarity between training and test examples. We also show that random pairing for gen-erating synthetic negative examples for training and performance evaluation results in models with better generalization performance in comparison to more sophisticated strategies used in existing studies. Furthermore, we have found that our kernel based approach, despite its simple design, exceeds the prediction performance of CPI-NN. We have used the proposed model for compound screening of several proteins including SARS-CoV-2 Spike and Human ACE2 proteins and found strong evidence in support of its top hits.
python=3.6
conda install -c conda-forge rdkit
NR-HCPI Dataset can be downloaded from this link [https://github.com/adibayaseen/HKRCPI/tree/main/Datasets/NR-HCPI]
External dataset from here https://github.com/adibayaseen/HKRCPI/blob/main/Datasets/BindingDB_m62021_top4000_1783000nM.txt
SuperDrugbank2 dataset from here https://github.com/adibayaseen/HKRCPI/blob/main/Datasets/approved_drugs_chemical_structure_identifiers.xlsx
- (a) Selection criteria applied to Binding DB for generating the nega-tive dataset can be downloaded from this link
[https://github.com/adibayaseen/HKRCPI/blob/25f5d4426f01806da123093685fe4632624f3319/Selection%20criteria%20applied%20to%20Binding%20DB%20for%20generating%20the%20nega-tive%20dataset.docx]
- (b) Comparative results for different Kernels can be downloaded from this link [https://github.com/adibayaseen/HKRCPI/blob/5b07821293317a980e3989e281594a4a9f367d53/Table%201%20comparative%20results%20for%20different%20Kernels.docx]
- (c) Experimental setup for screening with Non-redundant Cross-validation (NRCV) or Screening SuperDRUG2 can be downloaded from this link [https://github.com/adibayaseen/HKRCPI/blob/5b07821293317a980e3989e281594a4a9f367d53/Experimental%20setup%20for%20screening%20with%20Non-redundant%20Cross-validation%20(NRCV)%20or%20Screening%20SuperDRUG2.docx]
- (d) Experimental setup for RFPP calculation can be downloaded from this link [https://github.com/adibayaseen/HKRCPI/blob/5b07821293317a980e3989e281594a4a9f367d53/Experimental%20setup%20for%20RFPP%20calculation.docx]
- (e) Comparison of non-redundant cross-validation (CV) results of our proposed model with previous method CPI-NN using 40% redundancy removal can be downloaded from this link [https://github.com/adibayaseen/HKRCPI/blob/5b07821293317a980e3989e281594a4a9f367d53/NRCV%2040%20percent.docx]
- (f) Target Compound Screening (TCS) for Drugreprurposing RFPP for the test set for CPINN and our proposed model results comparison can be downloaded from this link [https://github.com/adibayaseen/HKRCPI/blob/5b07821293317a980e3989e281594a4a9f367d53/Drugreprurposing%20scores%20for%20all%20pairs%20of%20proteins%20in%20the%20test%20set%20for%20one%20fold.docx]
- (g) Target Compound Screening (TCS) for Drugreprurposing scores for all pairs of proteins in the test set for one fold can be downloaded from this link [https://docs.google.com/spreadsheets/d/1480q6VmhbPlBet1JrMydTb5p_Mp_zYJm/edit?usp=sharing&ouid=108408133666428329638&rtpof=true&sd=true]
- (h) RFPP of TCS proteins in the test fold (RFPP of protein whos all pair scores mentioned in the previous file can be downloaded from this link [https://github.com/adibayaseen/HKRCPI/blob/1f91d7b87a8fd910e03d4971dc055c310f230ae4/Target%20Compound%20Screening%20(TCS)%20RFPP%20of%20a%20proteins%20.xlsx]
- (i) SARS-Cov2 Results and Supporting Evidence can be downloaded from this link [https://github.com/adibayaseen/HKRCPI/blob/5b07821293317a980e3989e281594a4a9f367d53/ACE2%20%20and%20Spike%20Target%20Compound%20Screening%20(TCS)%20Results%20with%20supporting%20evidence.docx]
- (j) SARS-Cov2 results with top 100 predictions from our model can be downloaded from this link [https://github.com/adibayaseen/HKRCPI/blob/5b07821293317a980e3989e281594a4a9f367d53/ACE2%20and%20Spike%20Top%20100%20predictions.docx]
- (k) SARS-Cov2 Median ranks for ACE2 and Spike proteins along with names of top drugs predicted from our model can be downloaded from this link [https://github.com/adibayaseen/HKRCPI/blob/main/ACE2%20and%20Spike%20Median%20Results%20Excluding%20Lightweight%20Compounds.xlsx]
Proposed model used for prediction of test file in given format
Baseline SVM used for baseline results.
Product Kernel SVMused for experimental setup of product kernel.
Validation over BindingDB used for validation over experimentally verified negative examples from Binding DB
Dissimilarity Controlled Negative Examples generation Dissimilarity Controlled Negative Examples generation and proposed method.
RFPP Screening with Non-redundant Cross-validation (NRCV) where we train a model using training folds of the NRCV dataset and then compute prediction scores of all-vs-all compound-protein pairs in the test fold using the trained model (see supplementary information file for an illustration) and rank of first positive pair (RFPP) is calculated
RFPP with SuperDrug2 used for drug-repurposing analysis with the proposed model, we used the SuperDRUG2 dataset containing 3,633 FDA-approved drugs. In this experiment, a CPI model is first trained on all examples in training folds of the NRCV dataset and then used to generate prediction scores for all proteins in the test fold paired with all compounds in the SuperDRUG2 database (see supple-mentary material for an illustration of the experimental setup).
- Input File format
SMILES of Compound Protein Sequence label
Sample input data
File should be like Compound<space>Protein <space> label <newline>
Genearate_Prediction used for prediction of test file in given format
NegtiveRatio='7'
path='/content/drive/MyDrive/CPI_Data/'
- Select _NegtiveRatio from 1,3,5, and 7 and set path of the dataset and SuperDrugbank in case of TCS( Top predictions for given protein sequence.
If pair is given in the given sample style, prediction score from our model can generated and saved in the given path as excel file
Testfilename ='TestHKRCPI'
testscore, testP,testC,testY =PredictscorefromTestPairFile(path,Testfilename+'.txt',Ptr, Ctr,Pscaler,Cscaler)
- You can use generate prediction from pair if you have pair of compound and protein, TCS with SuperDrug if you have only one protein and want to see its top predictions
DataWriteTestfilepair(Testfile,NegtiveRatio,len(testscore),f,testP,testC,testscore)
- Data can be written in excel file for every fold
specify s, sname and n (s is the protein sequence, n is the top predictions that you want to see) returns sorted SuperdrugNames,sortedscore
sName='mytest'
n=100
s='MSSSSWLLLSLVAVTAAQSTIEEQTSF'
Topdrugs, Topdrugscores=PredictTopNscores(path,Ptr, Ctr,s,n)
DataWriteTestfilepair(Testfile,NegtiveRatio,n,f,s,Topdrugs,Topdrugscores)
Data can be written in excel file for every fold