Skip to content

adibayaseen/HKRCPI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

93 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Insights into performance evaluation of compound-protein interaction prediction methods

HKRCPI

HKRCPI: Heterogeneous Kernel Representation for Compound-Protein Interaction Prediction A Heterogeneous Kernel-based method that perdict classification of a compound and protein pair as interacting or non-interacting.This network captures protein features and Circular Fingerprint repersatation of SMILES.

Abstract

Abstract Motivation: Machine learning based prediction of compound-protein interactions (CPIs) is important for drug design, screening and repurposing studies and can improve the efficiency and cost-effectiveness of wet lab assays. Despite the publication of many research papers reporting CPI predictors in the recent years, we have observed a number of fundamental issues in experiment design that lead to over optimistic estimates of model performance. Results: In this paper, we analyze the impact of several important factors affecting generalization perfor-mance of CPI predictors that are overlooked in existing work:

  1. Similarity between training and test examples in cross-validation
  2. The strategy for generating negative examples, in the absence of experimentally verified negative examples.
  3. Choice of evaluation protocols and performance metrics and their alignment with real-world use of CPI predictors in screening large compound libraries. Using both an existing state-of-the-art method (CPI-NN) and a proposed kernel based approach, we have found that assessment of predictive performance of CPI predictors requires careful con-trol over similarity between training and test examples. We also show that random pairing for gen-erating synthetic negative examples for training and performance evaluation results in models with better generalization performance in comparison to more sophisticated strategies used in existing studies. Furthermore, we have found that our kernel based approach, despite its simple design, exceeds the prediction performance of CPI-NN. We have used the proposed model for compound screening of several proteins including SARS-CoV-2 Spike and Human ACE2 proteins and found strong evidence in support of its top hits.

Set Up Environment

python=3.6
conda install -c conda-forge rdkit

Dataset

NR-HCPI Dataset can be downloaded from this link [https://github.com/adibayaseen/HKRCPI/tree/main/Datasets/NR-HCPI]
External dataset from here https://github.com/adibayaseen/HKRCPI/blob/main/Datasets/BindingDB_m62021_top4000_1783000nM.txt
SuperDrugbank2 dataset from here https://github.com/adibayaseen/HKRCPI/blob/main/Datasets/approved_drugs_chemical_structure_identifiers.xlsx

Supplementary Data

Code Structure

Proposed model used for prediction of test file in given format
Baseline SVM used for baseline results.
Product Kernel SVMused for experimental setup of product kernel.
Validation over BindingDB used for validation over experimentally verified negative examples from Binding DB
Dissimilarity Controlled Negative Examples generation Dissimilarity Controlled Negative Examples generation and proposed method.
RFPP Screening with Non-redundant Cross-validation (NRCV) where we train a model using training folds of the NRCV dataset and then compute prediction scores of all-vs-all compound-protein pairs in the test fold using the trained model (see supplementary information file for an illustration) and rank of first positive pair (RFPP) is calculated
RFPP with SuperDrug2 used for drug-repurposing analysis with the proposed model, we used the SuperDRUG2 dataset containing 3,633 FDA-approved drugs. In this experiment, a CPI model is first trained on all examples in training folds of the NRCV dataset and then used to generate prediction scores for all proteins in the test fold paired with all compounds in the SuperDRUG2 database (see supple-mentary material for an illustration of the experimental setup).

Generate predictions

  • Input File format

SMILES of Compound Protein Sequence label
Sample input data

File should be like Compound<space>Protein <space> label <newline>

Genearate_Prediction used for prediction of test file in given format

NegtiveRatio='7'
path='/content/drive/MyDrive/CPI_Data/'
  • Select _NegtiveRatio from 1,3,5, and 7 and set path of the dataset and SuperDrugbank in case of TCS( Top predictions for given protein sequence.

For Prediction from given compound protein pair:

If pair is given in the given sample style, prediction score from our model can generated and saved in the given path as excel file

Testfilename ='TestHKRCPI'
testscore, testP,testC,testY =PredictscorefromTestPairFile(path,Testfilename+'.txt',Ptr, Ctr,Pscaler,Cscaler)
  • You can use generate prediction from pair if you have pair of compound and protein, TCS with SuperDrug if you have only one protein and want to see its top predictions
DataWriteTestfilepair(Testfile,NegtiveRatio,len(testscore),f,testP,testC,testscore)
  • Data can be written in excel file for every fold

For TCS:

specify s, sname and n (s is the protein sequence, n is the top predictions that you want to see) returns sorted SuperdrugNames,sortedscore

sName='mytest'
n=100
s='MSSSSWLLLSLVAVTAAQSTIEEQTSF'
Topdrugs, Topdrugscores=PredictTopNscores(path,Ptr, Ctr,s,n) 
DataWriteTestfilepair(Testfile,NegtiveRatio,n,f,s,Topdrugs,Topdrugscores) 

Data can be written in excel file for every fold

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published