- Step1: Please analyze
cazy_family_EC.csv
andcazysubfamily_EC.csv
first. If the family hmm is correct, then keep the family hmm. Otherwise, please discuss it with Dr.Yin. - Step2: Git clone the repo to your environment
- Step3: Download
original hmm files
to your environment, and the folder should be calledori_hmm_refe_combine
. Please ask Dr.Yin to get the original hmm files. - Step4: Move
examples/clustering/output/dbCAN3_new
andexamples/clustering/output/dbCAN3
fromsmallprotein/eCAMI
to your environment. - Step5: Read follows.
- step0: please read the instroduction of eCAMI first.
- step1: create a python execution environment based on the Introduction of eCAMI. The ouput will be in
examples/clustering/output/dbCAN3_new
. Please copy all of them intoexamples/clustering/output/dbCAN3
by using this commandcp -rfp * ../dbCAN3/
- step2: cluster CAZy families, please specify the CAZy family name in clustering.sh.
. clustering.sh
- step3: build new HMM, please specify the CAZy family name in hmm_analysis.sh.
. hmm_analysis.sh
- clustering.sh: a shell script to run
clustering.py
by specifying a specific CAZy family. - hmm_analysis.sh: a shell script to run a combination python files about building HMM.
- hmmscan_combine.py: a python script to run hmmscan based on combined hmm files.
- hmmscan_nocombine.py: a python script to run hmmscan based on no combined hmm files.
- hmmscan-parser.sh: a shell script to parse the result of hmmscan.
- hmm_maker.py: a python script to run hmmbuild.
- EZ_analysis.py: a python script to add EC number into HMM.
- counter.py: a python script to create two excels. One excel is
cazyfamily,# of proteins,# of eCAMI subfams (exclude unclassified),# of proteins in subfams,# of subfams with EC, # of proteins in subfams with EC
. Another one isCAZy subfam,# of proteins,# of proteins with EC,after hmmsearch # of remaining proteins,after hmmsearch # of remaining protein domains,after the usearch # of remaining proteins for mafft
- The output of
clustering.sh
isexamples/clustering/output/dbCAN3_new/
. Please copy all of them intoexamples/clustering/output/dbCAN3
by using this commandcp -rfp * ../dbCAN3/
. - The output of
hmmscan_combine.py
ishmm_analysis_combinded/cut_domain_seq
. - The output of
hmm_maker.py
ishmm_analysis_combinded/hmm
. - The output of
EZ_analysis.py
ishmm_refe_combine
. - The output of
hmmscan_combine.py
ishmm_analysis_combinded/seq_summary
. - The output of
counter.py
iscazy_family_EC.csv
andcazysubfamily_EC.csv
.
- Linux environment, Python 3
- The three packages need to be available: psutil, collections, scipy, numpy
- The two python scripts (clustering.py and prediction.py) need to be in the same folder
- the fasta file must follow this format:
>unique_sequence_ID followed by a “|”
git clone https://github.com/zhanglabNKU/eCAMI.git
cd eCAMI/
* install required python packages
pip install scipy
pip install argparse
pip install psutil
pip install numpy
given new protein sequences, search against a pre-computed k-mer peptide library, which are associated with known CAZyme families or EC numbers, for CAZyme family or EC number assignments
- First, two pre-computed k-mer peptide libraries come with this package (the CAZyme and EC folder) and must exist in your folder
- The prediction.py can be run as follows:
python prediction.py -input <file_name> -kmer_db <kmer_dir_path> -output <prediction_output_name>
For example:
python prediction.py -input examples/prediction/input/test.faa -kmer_db CAZyme -output examples/prediction/output/new-output.txt
The test.faa in the
examples/prediction/input
folder will be processed by comparing to the pre-computed k_mer peptides (the k_mer library) in theCAZyme
folder. You will get the prediction results namednew-output.txt
in theexamples/prediction/output
folder. You can compare it with thetest_pred_cluster_labels.txt
that is already there.
-
In addition, other options for prediction are also available:
-k_mer
: the length of the k_mer peptide to be extracted from the query for comparison against the k_mer peptide library; the length must be the same as the length of the k_mers in the k_mer peptide library (default=8)-jobs
: the number of CPU processors to use (default=8)-beta
: the minimum sum of k_mer frequency for assigning the query to an existing protein family of the k_mer peptide library (default=1)-important_k_mer_number
: the minimum number of the same k_mers shared between the query and a family in the k_mer peptide library (default=3)
-
In most of times, only option
-k_mer
is needed to change -
To print all options, you can type:
python prediction.py -help
-
Output files: please see the readme.txt file in the examples/prediction folder for explaination of the output files and format.
-
Briefly, the output.txt will have the assignment of the query to the existing protein family of the k_mer peptide library, e.g.,:
protein_name fam_name:group_number subfam_name_of_the_group:subfam_name_count
matching kmers(start position in the query sequence)
For example:
>AIJ19564.1|GH5_4|3.2.1.4 GH5:102 GH5_4:15|3.2.1.4:3
SVRIPVTW(82),RIPVTWMG(84),PVTWMGHI(86),VTWMGHIG(87),IINIHHDG(124),VLNEWNQV(210),NRLMVAVH(259),
given a fasta file of a protein family, e.g., CAZyme family or EC at the 3rd level or any protein family, classify the sequences into clusters and extract the distinguishing k-mer peptides of each cluster (i.e., build your own library of k-mer peptides for future use)
- The clustering.py can be run as follows:
python clustering.py -input <file_name> -output_dir <output_dir_name>
For example:
python clustering.py -input examples/clustering/input/GH5.faa -output_dir examples/clustering/output/new-GH5
python clustering.py -input examples/clustering/input/3.2.1.-.faa -output_dir examples/clustering/output/new-EC3.2.1.-
The
GH5.faa
in theexamples/clustering/input
folder will be processed. You will get the clustering results (fasta seqs and distinguising k-mer peptides of each cluster) in theexamples/clustering/output/new-GH5
folder.
The3.2.1.-.faa
in theexamples/clustering/input
folder will be processed. You will get the clustering results (fasta seqs and distinguising k-mer peptides of each cluster) in theexamples/clustering/output/new-EC3.2.1.-
folder.
You can compare them with theGH5
and theEC3.2.1.-
folders that are already there.
- In addition, other options for clustering are also available:
-k_mer
: the length of the k-mer peptide for clustering (default=8)-minimum_group_size
: the minimum number of proteins the final cluster (default=5)-piece_number
: the number of spliting pieces of sorted peptides (default=8)-jobs
: the number of CPU processors to use (default=1)-important_k_mer_number
: the minimum number of k_mer for making groups/clusters in the first step (default=10)-alpha
: the minimum sum of k_mer frequency for combining groups (default=7)-beta
: the minimum sum of k_mer frequency for adding singletons into existing groups (default=1)-del_percent
: the minimum percentage a k_mer's frequency in a cluster divided by its frequency in the whole protein set to be considered for clustering (default=0.2)-selected_k_mer_number_cut
: the minimum number of proteins that a k_mer is present to remove before clustering (default=2)
- In most of times, only options
-k_mer
,-minimum_group_size
and-jobs
are needed to change - To print all options, you can type:
python clustering.py -help
- Output files: please see the readme.txt file in the examples/clustering folder for explanation of the output files and format.