Skip to content

Latest commit

 

History

History
170 lines (101 loc) · 5 KB

README.md

File metadata and controls

170 lines (101 loc) · 5 KB

cdd2cog

cdd2cog.pl is a script to assign COG categories to query protein sequences.

Synopsis

perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog

Description

For troubleshooting and a working example please see issue #1.

The script assigns COG (cluster of orthologous groups) categories to proteins. For this purpose, the query proteins need to be blasted with RPS-BLAST+ (Reverse Position-Specific BLAST) against NCBI's Conserved Domain Database (CDD). Use cds_extractor.pl beforehand to extract multi-fasta protein files from GENBANK or EMBL files.

Both tab-delimited RPS-BLAST+ outformats, -outfmt 6 and -outfmt 7, can be processed by cdd2cog.pl. By default, RPS-BLAST+ hits for each query protein are filtered for the best hit (lowest e-value). Use option -a|all_hits to assign COGs to all BLAST hits and e.g. do a downstream filtering in a spreadsheet application. Results are written to tab-delimited files in the './results' folder, overall assignment statistics are printed to STDOUT.

Several files are needed from NCBI's FTP server to run the RPS-BLAST+ and cdd2cog.pl:

  1. CDD (ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/)

    More information about the files in the CDD FTP archive can be found in the respective 'README' file.

  2. 'cddid.tbl.gz'

The file needs to be unpacked:

`gunzip cddid.tbl.gz`

Contains summary information about the CD models in a tab-delimited format. The columns are: PSSM-Id, CD accession (e.g. COG#), CD short name, CD description, and PSSM (position-specific scoring matrices) length.
  1. './little_endian/Cog_LE.tar.gz'
Unpack and untar via:

`tar xvfz Cog_LE.tar.gz`

Preformatted RPS-BLAST+ database of the CDD COG distribution for Intel CPUs and Unix/Windows architectures.
  1. COG (ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG/)

    Read 'readme' for more information about the respective files in the COG FTP archive.

  2. 'fun.txt'

One-letter functional classification used in the COG database.
  1. 'whog'
Name, description, and corresponding functional classification of each COG.

Usage

RPS-BLAST+

rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt 6
rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs'

cdd2cog

perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog -a

Options

Mandatory options

  • -r, -rps_report

    Path to RPS-BLAST+ report/output, outfmt 6 or 7

  • -c, -cddid

    Path to CDD's 'cddid.tbl' file

  • -f, -fun

    Path to COG's 'fun.txt' file

  • -w, -whog

    Path to COG's 'whog' file

Optional options

  • -h, -help

    Help (perldoc POD)

  • -a, -all_hits

    Don't filter RPS-BLAST+ output for the best hit, rather assign COGs to all hits

  • -v, -version

    Print version number to STDERR

Output

  • STDOUT

    Overall assignment statistics

  • ./results

    All tab-delimited output files are stored in this result folder

  • rps-blast_cog.txt

    COG assignments concatenated to the RPS-BLAST+ results for filtering

  • protein-id_cog.txt

    Slimmed down 'rps-blast_cog.txt' only including query id (first BLAST report column), COGs, and functional categories

  • cog_stats.txt

    Assignment counts for each used COG

  • func_stats.txt

    Assignment counts for single-letter functional categories

Run environment

The Perl script runs under UNIX flavors.

Author - contact

Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)

Acknowledgements

I got the idea for using NCBI's CDD PSSMs for COG assignment from JGI's IMG/ER annotation system, which employes the same technique.

Citation, installation, and license

For citation, installation, and license information please see the repository main README.md.

Changelog

  • v0.2 (2017-02-16)
    • Adapted to new NCBI FASTA header format for CDD RPS-BLAST+ output
  • v0.1 (2013-08-01)