cdd2cog.pl
is a script to assign COG categories to query protein sequences.
- Synopsis
- Description
- Usage
- Options
- Output
- Run environment
- Author - contact
- Acknowledgements
- Citation, installation, and license
- Changelog
perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog
For troubleshooting and a working example please see issue #1.
The script assigns COG (cluster of orthologous
groups) categories to proteins.
For this purpose, the query proteins need to be blasted with
RPS-BLAST+ (Reverse Position-Specific BLAST)
against NCBI's Conserved Domain Database
(CDD). Use
cds_extractor.pl
beforehand to extract multi-fasta protein
files from GENBANK or EMBL files.
Both tab-delimited RPS-BLAST+ outformats, -outfmt 6 and -outfmt
7, can be processed by cdd2cog.pl
. By default, RPS-BLAST+ hits
for each query protein are filtered for the best hit (lowest
e-value). Use option -a|all_hits to assign COGs to all BLAST hits
and e.g. do a downstream filtering in a spreadsheet application.
Results are written to tab-delimited files in the './results'
folder, overall assignment statistics are printed to STDOUT.
Several files are needed from NCBI's FTP server to run the RPS-BLAST+ and cdd2cog.pl
:
-
CDD (ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/)
More information about the files in the CDD FTP archive can be found in the respective 'README' file.
-
'cddid.tbl.gz'
The file needs to be unpacked:
`gunzip cddid.tbl.gz`
Contains summary information about the CD models in a tab-delimited format. The columns are: PSSM-Id, CD accession (e.g. COG#), CD short name, CD description, and PSSM (position-specific scoring matrices) length.
- './little_endian/Cog_LE.tar.gz'
Unpack and untar via:
`tar xvfz Cog_LE.tar.gz`
Preformatted RPS-BLAST+ database of the CDD COG distribution for Intel CPUs and Unix/Windows architectures.
-
COG (ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG/)
Read 'readme' for more information about the respective files in the COG FTP archive.
-
'fun.txt'
One-letter functional classification used in the COG database.
- 'whog'
Name, description, and corresponding functional classification of each COG.
rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt 6
rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs'
perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog -a
-
-r, -rps_report
Path to RPS-BLAST+ report/output, outfmt 6 or 7
-
-c, -cddid
Path to CDD's 'cddid.tbl' file
-
-f, -fun
Path to COG's 'fun.txt' file
-
-w, -whog
Path to COG's 'whog' file
-
-h, -help
Help (perldoc POD)
-
-a, -all_hits
Don't filter RPS-BLAST+ output for the best hit, rather assign COGs to all hits
-
-v, -version
Print version number to STDERR
-
STDOUT
Overall assignment statistics
-
./results
All tab-delimited output files are stored in this result folder
-
rps-blast_cog.txt
COG assignments concatenated to the RPS-BLAST+ results for filtering
-
protein-id_cog.txt
Slimmed down 'rps-blast_cog.txt' only including query id (first BLAST report column), COGs, and functional categories
-
cog_stats.txt
Assignment counts for each used COG
-
func_stats.txt
Assignment counts for single-letter functional categories
The Perl script runs under UNIX flavors.
Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
I got the idea for using NCBI's CDD PSSMs for COG assignment from JGI's IMG/ER annotation system, which employes the same technique.
For citation, installation, and license information please see the repository main README.md.
- v0.2 (2017-02-16)
- Adapted to new NCBI FASTA header format for CDD RPS-BLAST+ output
- v0.1 (2013-08-01)