GitHub - Zhaonan99/NVDT: predict protein-protein interaction and non-interaction

Zhaonan99 / NVDT Public

Notifications You must be signed in to change notification settings
Fork 0
Star 3

predict protein-protein interaction and non-interaction

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Part1-Datasets		Part1-Datasets
Part2-Feature extraction and standardization		Part2-Feature extraction and standardization
Part3-Training datasets and testing datasets		Part3-Training datasets and testing datasets
Part4-Training model and prediction		Part4-Training model and prediction
Part5-Model evaluation		Part5-Model evaluation
Part6-Independent cross-species datasets		Part6-Independent cross-species datasets
model(SVM)		model(SVM)
.gitattributes		.gitattributes
README.txt		README.txt

Repository files navigation

NVDT
A new computational method for protein-protein interaction and non-interaction predictions from the gene sequence


Part 1：Datasets
---Positive and negative samples of each species.

Each .xlsx file has two columns and several rows. Each row represents one protein pair (positive samples are interacting protein pairs, negative samples are non-interacting protein pairs)
(1) Real-Datasets: Positive samples are collected from the DIP Database, and negative samples are collected from the Negatome Database. There are two species here, H. sapiens and M. musculus. 
The name of the positive samples dataset file should be the name of your species followed by '_Positive_Real' (e.g., M. musculus_Positive_Real.xlsx).
The name of the negative samples dataset file should be the name of your species followed by '_ Negative_Real' (e.g., M. musculus_Negative_Real.xlsx).
(2) Constructed-Datasets: Positive samples are collected from the DIP Database, and negative samples are constructed by pairing the proteins located in different subcellular positions.
The name of the positive samples dataset file should be the name of your species followed by '_Positive_Constructed' (e.g., S.cerevisiae_Positive_Constructed.xlsx).
The name of the negative samples dataset file should be the name of your species followed by '_ Negative_Constructed' (e.g., S.cerevisiae_Negative_Constructed.xlsx).



Part 2：Feature extraction and standardization 
--- Sequence features are extracted and further standardized from protein pairs.

Run file 'Feature_extraction.m' to extract the features of protein pairs in the training dataset and testing dataset.
Run file 'Feature_standardize.ipynb' to standardize the protein pairs' feature vectors in the training dataset and the testing dataset.



Part 3：Training datasets and testing datasets 
--- Files that store the features and labels of protein pairs in the training dataset and the testing dataset of each species.

Necessary input files before running the command:
(1) Input files of random forest classifier: It is in .xlsx format. (This only works on Real-Dataset.)
There are four sub-tables in the Excel table, which store the training dataset feature vectors and corresponding labels, the testing dataset feature vectors and labels, respectively.
The name of this file should be the name of your species and data type followed by '_RFInput' (e.g., H. sapien_Real_RFInput.xlsx and H. sapien_Constructed_RFInput.xlsx).
(2) Input files of support vector machine classifier: It is in txt format. (This works on Real-Dataset and Constructed -dataset.)
-Training dataset: The name of this file should be the name of your species followed by '_Train' (e.g., S.cerevisiae_Train.txt).
Suppose that the training dataset has M samples, and each sample has N-dimensional features. The form of the document is as follows: 
[label 1] [1: feature 1] [2: feature 2] … [N: feature N]
[label 2] [1: feature 1] [2: feature 2] … [N: feature N]
…
[label M] [1: feature 1] [2: feature 2] … [N: feature N]
The .txt file is generated by a file named ' FormatDataLibsvm.xls. '
-Testing dataset: Same as the Training dataset. The name of this file is the same as species followed by '_Test' (e.g., S.cerevisiae_Test.txt). 
Similar to Training data, you can make the .txt file using FormatDataLibsvm.xls



Part 4：Training model and prediction 

(1) random forest classifier: Run file ‘RF_classifier.ipynb’.
(2) support vector machine classifier: 
Install LIBSVM. ( LIBSVM download address: http://www.csie.ntu.edu.tw/~cjlin/libsvm/ )
svm-train.exe: input training dataset to generate a model. 
Usage: svm-train.exe [options] training_ dataset_ file [model_ file]
svm-predict.exe: used to predict results of the testing dataset 
Usage: svm-predict.exe [options] test_ file model_ file output_ file



Part 5：Model evaluation 
(1) Summary of prediction results of random forest classifier
File ‘RF_classifier.ipynb’ contains the calculation of each index.
(2) Summary of prediction results of support vector machine classifier
Run file ‘AUC_value.m’ and ‘prediction_index.m’.


Part6-Independent cross-species datasets
(1)train dataset: all S. cerevisiae samples
(2)test datasets: the other five species in the DIP database, including D. melanogaster, H. pylori, Caenorhabditis elegans, M. musculus and Escherichia coli.