A tool for genome annotation using top-down mass spectrometry.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
lib
scripts
spectrogene
.gitignore
LICENSE
README.md
spectrogene.py

README.md

SpectroGene

Version 1.0

Description

SpectroGene is a tool for top-down protein identification using unannotated bacterial genome. SpectroGene could be used for disocvery of previously unannotated genes as well as correction of known genes coordinates (especially, Start codons). Almost all PrSMs which are found with a standard search with a known proteome could be also identified with SpectroGene without proteome reference.

Availability

Currently only Linux x86-64 systems are supported (because of TopPIC dependency).

Requirements

Install

SpectroGene is written on Python and requires no installation. However, it depends on some extra Python modules listed in "Requirements" section. These dependencies could be installed via your system's package manager or "pip" (e.g. "pip install biopython").

Quick Usage

usage: spectrogene.py [-h] [-o OUTPUT_DIR] [-p NUM_PROC] [-e E_VALUE]
                      [--version]
                      genome_file spectra_file

SpectroGene v1.0: Genome annotation using top-down mass spectra

positional arguments:
  genome_file           path to genome file in FASTA fromat
  spectra_file          path to spectra file in MsAlign format

optional arguments:
   -h, --help            show this help message and exit
   -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        output directory (default: spectrogene-out)
  -p NUM_PROC, --num_proc NUM_PROC
                        number of processes (default: 1)
  -e E_VALUE, --evalue E_VALUE
                        E-value threshold for PrSM (default: 0.01)
  --version             show program's version number and exit

Running Instructions

SpectroGene takes as input:

  • Genome file in FASTA format
  • Deconvoluted top-down spectra in msalign format (similar to MGF)

A typical pipeline for raw spectra (such as from Thermo Scientific™ Orbitrap™ mass spectrometers) would include centroiding and deconvolution.

First, raw spectra should be centroided and converted to mzXML format, which could be done using msconvert utility [available for Windows only: http://proteowizard.sourceforge.net/tools.shtml]. Here is an example command line:

msconvert.exe --filter "peakPicking true 1-" --mzXML --zlib spectra.raw

Then, the resulting mzXML file should be deconvoluted using MsDeconv software [http://bix.ucsd.edu/projects/msdeconv/]:

java -jar msdeconv.jar spectra.mzXML -o msalign

Finally, you can run SpectroGene on resulting .msalign file:

./spectrogene.py genome.fasta spectra.msalign

Output

SpectroGene provides three kinds of output:

  • PrSMs mapped on the genome sequence
  • HTML visualisation of PrSMs (done by TopPIC)
  • ORF clusters of identified proteoforms

PrSMs are output into a file called "prsms.txt" in the output directory. It is formatted as CSV table and have self-explanatory columns. A graphical representations of all identified spectra in HTML fromat could be found in "html_prsms" directory.

Also, raw TopPIC output with additional spectra information is available in "toppic_merged.txt" file.

The information about identified ORF cluster is output into "orf_clusters.txt" file. For each cluster multiple statistics are provided (such as positions, number of spectra/proteoforms etc.). Below the statistics, proteoforms aligned on the ORF sequence are shown. Putative Start codons are marked with "v" symbol on the top of the ORF sequence. Each proteoform might have multiple corresponding spectra. In that case, we choose the spectra with the lowest P-value as representative. Representative spectra ids and P-values are shown on the right side of the proteoforms. Also, post-translational modifications are shown above the corresponding proteoforms (with numbers corresponding to a modification weight).

Useful scripts

SpectrGene distribution also contain some additional scripts which might be useful (located in "scripts" directory):

merge_msalign.py

Merges multiple msalign spectra files into one.

statistics.py

Provides some extra statistics on the SpectroGene run.

compare.py

Compares two SpectrGene runs (for benchmarking purposes)

proteome_run.py

Runs SpectroGene with a known proteome (also for benchmarking, see the paper for details).

License

The program is distributed under BSD license. See LICENSE file for details.

Citation

  • Mikhail Kolmogorov, Xiaowen Liu and Pavel Pevzner. "SpectroGene: a tool for proteogenomic annotations using top-down spectra"

Authors

  • Mikhail Kolmogorov (UCSD)

Contacts

Please report any bugs directly to the issue tracker of this project. Also, you can send your feedback at fenderglass@gmail.com