in house grouping program for proteomics
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
gpgrouper
scripts
.gitignore
LICENSE
README.org
__init__.py
database_config.py
requirements.txt
setup.py

README.org

Requirements

Input File

The input file should be a resulting PSM (peptide-spectrum match) output file from a database search such as Mascot or Andromeda in tab delimited format. The minimum required columns are :

Sequence
The amino acid sequence of a PSM
Modifications
A column that describes the modifications of the sequence.

(Note for MaxQuant this information is already provided in Modified Sequence)

PrecursorArea
(or Intensity) the AUC (or SPCs) for each PSM
IonScore
Mascot (or search engine equivalent) score for each PSM
q_value
Percolator (or search engine equivalent) score for each PSM Note that PEP can be used in place of q_value with the rough observational approximation that PEP / 10 = q_value
  • Charge

Renaming Columns

A base configuration file for renaming column headers is provided with gpgrouper To generate it, simply run gpgrouper getconfig and a config file will be generated. The configuration file can be specified by the --configfile flag in gpgrouper run, but if not specified and the default gpgrouper_config.ini file exists, it will be used.

This file can be edited with addition of new column aliases as needed, though it should be set up to work with ProteomeDiscoverer+Mascot and MaxQuant already.

Additional info for MaxQuant

For MaxQuant output, the evidence.txt PSMs file should be used as input. gpgrouper is designed to be run separately for each experiment. As of writing, it is not configured to work with multiple label-free experiments searched together under say MaxQuant. Therefore, separate experiments need to be separated. In other words, if MaxQuant is used to search multiple experiments at once (potentially with match between runs) the evidence.txt file needs to be separated into separate “experiment” files. A simple script is available for doing just this.

FASTA Database File

The database file should be a pre-constructed tab delimited file for matching PSMs to their respective GeneIDs. The required columns are TaxonID, HomologeneID, GeneID, ProteinGI, FASTA. Note that PyGrouper uses GeneID to group PSMs, so if a GeneID is lacking for a desired grouping another identifier can be substituted in such as ProteinGI. HomologeneID can be an empty column if this information is not available or desired.