Prokaryotic virus Host Predictor (PHP) is a computational tool for host prediction of prokaryotic viruses based on Gaussian Model (GM).
PHP takes the complete or partial genomic sequences of prokaryotic viruses as inputs. For each virus sequence,
PHP automatically calculates the probability of host for 60,105 prokaryotic genomes,
and takes the prokaryotic genome with the largest probability as the predicted host.
PHP would output the name of prokaryotic genome with the largest score, and output the host score of all prokaryotic genomes.
Python packages pandas
,numpy
,sklearn
are needed to be installed before Installation of phageHostPredictor. The program needs to run on the Linux operating system
pip install pandas
pip install numpy
pip install -U scikit-learn
python3 countKmer.py --fastaFileDir ./exampleHostGenome --kmerFileDir ./exampleOutput --kmerName HostKmer --coreNum -1
Or use the simplify command
python3 countKmer.py -f ./exampleHostGenome -d ./exampleOutput -n HostKmer -c -1
--fastaFileDir
or-f
: The fasta file of prokaryotic genome sequences, one genome per file.
--kmerFileDir
or-d
: The path of prokaryotic K-mer file.
--kmerName
or-n
: The name of prokaryotic K-mer file.
--coreNum
or-c
: The number of cores used in k-mer calculation. -1 represents the use of all cores.
*K-mer file of 60,105 prokaryotic genomes is saved in current folder and named hostKmer_60105_kmer4.tar.gz
python3 PHP.py --virusFastaFileDir ./exampleVirusGenome --outFileDir ./exampleOutput --bacteriaKmerDir ./exampleOutput --bacteriaKmerName HostKmer
Or use the simplify command
python3 PHP.py -v ./exampleVirusGenome -o ./exampleOutput -d ./exampleOutput -n HostKmer
--virusFastaFileDir
or-v
: The fasta file of query virus sequences, one virus genome per file.
--outFileDir
or-o
: The path of temp files and result files.
--bacteriaKmerDir
or-d
: The path of prokaryotic K-mer file.
--bacteriaKmerName
or-n
: The name of prokaryotic K-mer file.
After running the prediction program, you will see the output files Prediction_Maxhost and Prediction_Allhost in the output folder.
In document
Prediction_Maxhost.tsv
, The first column is the input virus,The third column is the highest score host and the second is the score of this host.
In documentPrediction_Allhost.csv
, query viruses and scores for all prokaryotic genomes are given.
The corresponding accuracy of different scoring thresholds is as follows.
Users can use their own data to train customized models, /exampleTrainingData/
provides sample data, each folder containing one pair of viruses and host genomes in exampleTrainingData.
The virus needs to be saved in the /phage/
folder, and the host needs to be saved in the /host/
folder.
Then run the following command to automatically train to get the model
python3 PHP_UserTrain.py --trainDataDir ./exampleTrainingData --outFileDir ./exampleOutput --modelName UserModel.m
--trainDataDir
The training data folder, in which each pair of training data is saved in its own folder
--outFileDir
The path to save the trained model
--modelName
The name of the trained model
Users need to rename and replace the original PHP model to use the customized model
If your input sequences contain sequences shorter than 12500 bp, the program will automatically identify these short segments and use the corresponding model to predict.
if 12500bp< length,The sequence will be predicted using a full length model;
if 7500bp < length <= 12500bp, The sequence will be predicted using a 10000bp model;
if 4000bp < length <= 7500bp, The sequence will be predicted using a 5000bp model;
if 2000 < length <= 4000bp, The sequence will be predicted using a 3000bp model;
if length<=2000bp, The sequence will be predicted using a 1000bp model;