Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
197 changes: 150 additions & 47 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
Overview
======
# MS-GF+

[![CI](https://github.com/bigbio/msgfplus/actions/workflows/ci.yml/badge.svg)](https://github.com/bigbio/msgfplus/actions/workflows/ci.yml)

MS-GF+ (aka MSGF+ or MSGFPlus) performs peptide identification by scoring
MS/MS spectra against peptides derived from a protein sequence database.
Expand All @@ -13,68 +14,170 @@ It supports a variety of input file formats, including mzML, mzXML,
Mascot Generic File (mgf), MS2 files, Micromass Peak List files (pkl),
and Concatenated DTA files (_dta.txt).

Requirements
======
## Requirements

- Java Runtime 17 or higher (use 64-bit Java)
- At least 2 GB of memory (4 GB+ recommended); larger FASTA files require more memory

## Installation

```bash
# Download the latest release JAR
# Place MSGFPlus.jar in any folder
```

## Quick Start

```bash
# Basic search
java -Xmx4G -jar MSGFPlus.jar \
-s spectra.mzML \
-d database.fasta \
-o results.mzid

# TMT search with target-decoy analysis
java -Xmx8G -jar MSGFPlus.jar \
-s spectra.mzML \
-d database.fasta \
-tda 1 \
-t 20ppm \
-ti -1,2 \
-inst 1 \
-e 1 \
-protocol 4 \
-mod mods.txt \
-o results.mzid

# Convert mzid output to TSV
java -cp MSGFPlus.jar edu.ucsd.msjava.ui.MzIDToTsv \
-i results.mzid \
-o results.tsv
```

## Parameters

### Required

| Flag | Name | Description |
|------|------|-------------|
| `-s` | SpectrumFile | Input spectrum file (`*.mzML`, `*.mzXML`, `*.mgf`, `*.ms2`, `*.pkl`, `*_dta.txt`). Spectra should be centroided. |
| `-d` | DatabaseFile | Protein sequence database (`*.fasta`, `*.fa`, `*.faa`). |

### Core Search Parameters

| Flag | Name | Default | Description |
|------|------|---------|-------------|
| `-o` | OutputFile | `[input].mzid` | Output file path (`.mzid` format). |
| `-conf` | ConfigurationFile | — | Configuration file; command-line options override config file settings. |
| `-t` | PrecursorMassTolerance | `20ppm` | Precursor mass tolerance (e.g., `2.5Da`, `20ppm`, or `0.5Da,2.5Da` for asymmetric). |
| `-ti` | IsotopeErrorRange | `0,1` | Range of allowed isotope peak errors (e.g., `-1,2`). |
| `-tda` | TDA | `0` | Target-decoy analysis: `0` = don't search decoy database, `1` = search decoy database. |
| `-decoy` | DecoyPrefix | `XXX` | Prefix for decoy protein names in the FASTA file. |

### Fragmentation and Instrument

| Flag | Name | Default | Description |
|------|------|---------|-------------|
| `-m` | FragmentationMethodID | `0` | `0` = As written in spectrum or CID if no info, `1` = CID, `2` = ETD, `3` = HCD, `4` = UVPD. |
| `-inst` | InstrumentID | `0` | `0` = Low-res LCQ/LTQ, `1` = Orbitrap/FTICR/Lumos (default for HCD), `2` = TOF, `3` = Q-Exactive. |

### Enzyme and Digestion

| Flag | Name | Default | Description |
|------|------|---------|-------------|
| `-e` | EnzymeID | `1` | `0` = Unspecific, `1` = Trypsin, `2` = Chymotrypsin, `3` = Lys-C, `4` = Lys-N, `5` = Glu-C, `6` = Arg-C, `7` = Asp-N, `8` = alphaLP, `9` = No cleavage, `10` = TrypPlusC. |
| `-ntt` | NTT | `2` | Number of tolerable termini: `0` = non-specific, `1` = semi-specific, `2` = fully specific. |
| `-maxMissedCleavages` | MaxMissedCleavages | `-1` | Maximum missed cleavages (`-1` = no limit). |

### Peptide Filtering

| Flag | Name | Default | Description |
|------|------|---------|-------------|
| `-minLength` | MinPepLength | `6` | Minimum peptide length to consider. |
| `-maxLength` | MaxPepLength | `40` | Maximum peptide length to consider. |
| `-minCharge` | MinCharge | `2` | Minimum precursor charge (if not in spectrum file). |
| `-maxCharge` | MaxCharge | `3` | Maximum precursor charge (if not in spectrum file). |
| `-msLevel` | MSLevel | `2` | MS level(s) to search (e.g., `2` or `2,3` for MS2+MS3). |

### Modifications and Protocol

| Flag | Name | Default | Description |
|------|------|---------|-------------|
| `-mod` | ModificationFileName | — | Modification file path. If not specified, uses standard amino acids with fixed Carbamidomethyl C. |
| `-numMods` | NumMods | `3` | Maximum number of dynamic (variable) modifications per peptide. |
| `-protocol` | ProtocolID | `0` | `0` = Automatic, `1` = Phosphorylation, `2` = iTRAQ, `3` = iTRAQPhospho, `4` = TMT, `5` = Standard. |

### Output and Performance

Java Runtime 8 or higher (use 64-bit Java)\
At least 2GB of memory (recommended to use 4GB); larger FASTA files require more memory
| Flag | Name | Default | Description |
|------|------|---------|-------------|
| `-n` | NumMatchesPerSpec | `1` | Number of matches per spectrum to report. Values >1 may skew FDR. |
| `-addFeatures` | AddFeatures | `0` | `0` = basic scores, `1` = additional features (enable for Percolator). |
| `-thread` | NumThreads | All cores | Number of concurrent threads. |
| `-tasks` | NumTasks | `0` | Override task count: `0` = auto, `>0` = exact count, `<0` = multiplier of threads. |
| `-verbose` | Verbose | `0` | `0` = total progress only, `1` = per-thread progress. |
| `-ccm` | ChargeCarrierMass | `1.00727649` | Mass of charge carrier (proton). |

Releases after April 2019 support Java 11 and newer; older releases will not work with Java 11 or newer (at least for reading mzML files) due to the deprecation and removal of some built-in libraries.
### Advanced Parameters

Downloads / Updates
======
[![Github This Release](https://img.shields.io/github/downloads/MSGFPlus/msgfplus/total.svg)]() [![Github This Release](https://img.shields.io/github/downloads/MSGFPlus/msgfplus/latest/total.svg)]()
* https://github.com/MSGFPlus/msgfplus/releases
| Flag | Name | Default | Description |
|------|------|---------|-------------|
| `-minNumPeaks` | MinNumPeaksPerSpectrum | `10` | Minimum number of peaks per spectrum. |
| `-iso` | NumIsoforms | `128` | Number of isoforms to consider per peptide. |
| `-ignoreMetCleavage` | IgnoreMetCleavage | `0` | `0` = consider N-term Met cleavage, `1` = ignore. |
| `-allowDenseCentroidedPeaks` | AllowDenseCentroidedPeaks | `0` | `0` = skip spectra failing density check, `1` = allow dense centroided spectra. |

*Version number notes*
## Modification File Format

Modifications are specified in a text file passed via `-mod`. Each line defines a static or dynamic modification:

As of [January 20, 2016 (commit 375d462)](https://github.com/MSGFPlus/msgfplus/commit/375d462e30cbe460b699091a7d6ba52bc192aba1) the version numbering scheme changed.
Previously the version number was the SVN commit number; git does not have simple commit numbers, so MSGFPlus was changed to a date-based version numbering scheme.
```
# Format: Mass_or_Composition, Residues, ModType, Position, Name

An example: v10282 became v2016.01.20
# Static modifications
StaticMod=C2H3N1O1, C, fix, any, Carbamidomethyl # Fixed alkylation
StaticMod=229.1629, *, fix, N-term, TMT6plex
StaticMod=229.1629, K, fix, any, TMT6plex

Installation
======
# Dynamic modifications
DynamicMod=O1, M, opt, any, Oxidation # Oxidized methionine
DynamicMod=HO3P, STY, opt, any, Phospho # Phosphorylation
DynamicMod=H-1N-1O1, NQ, opt, any, Deamidated # Deamidation

Unzip MSGFPlus.zip\
Place MSGFPlus.jar in any folder
# Position options: any, N-term, C-term, Prot-N-term, Prot-C-term
```

Usage Information
======
See [`docs/examples/MSGFPlus_Params.txt`](docs/examples/MSGFPlus_Params.txt) for a complete example configuration file.

Type `java -jar MSGFPlus.jar` for command line arguments.
## Configuration File

To convert an mzid output file into a tsv file, run `java -cp MSGFPlus.jar edu.ucsd.msjava.ui.MzIDToTsv`
All parameters can also be specified in a configuration file (passed via `-conf`). Command-line options override configuration file settings. Use the parameter's config name (e.g., `PrecursorMassTolerance=20ppm`) instead of the flag form.

Alternatively, use the Mzid-To-Tsv-Converter, which is a faster converter that supports larger result files.
It is a C# application that works on Windows or on Linux using mono.
Download the Mzid-To-Tsv-Converter <a href="https://github.com/PNNL-Comp-Mass-Spec/Mzid-To-Tsv-Converter/releases">from GitHub</a>.
## Building from Source

For detailed documentation, see the "docs" subfolder, or visit:
* [GitHub project help pages](https://msgfplus.github.io/msgfplus/)
* [GitHub repo HTML help pages - same as above, but may have issues](https://htmlpreview.github.io/?https://github.com/MSGFPlus/msgfplus/blob/master/docs/index.html)
* (previously at https://bix-lab.ucsd.edu/pages/viewpage.action?pageId=13533355)
```bash
# Requires Java 17+ and Maven
mvn package

Contact Information
======
# Run tests
mvn test

PNNL Proteomics [proteomics@pnnl.gov]\
Sangtae Kim [sangtae.kim (at) gmail.com]
# The JAR is produced at target/MSGFPlus.jar
```

Publications
======
## Publications

"MS-GF+ makes progress towards a universal database search tool for proteomics,"\
Sangtae Kim and Pavel A Pevzner,
Nat Commun. 2014 Oct 31; 5:5277. doi: 10.1038/ncomms6277.\
https://pubmed.ncbi.nlm.nih.gov/25358478/
Kim S. and Pevzner P.A.,
"MS-GF+ makes progress towards a universal database search tool for proteomics,"
*Nat Commun.* 2014 Oct 31; 5:5277.
[doi: 10.1038/ncomms6277](https://doi.org/10.1038/ncomms6277)

"Spectral Probabilities and Generating Functions of Tandem Mass Spectra: A Strike against Decoy Databases",\
Sangtae Kim, Nitin Gupta, and Pavel A Pevzner,
J Proteome Res. 2008 Aug; 7(8):3354-63. doi: 10.1021/pr8001244.\
https://pubmed.ncbi.nlm.nih.gov/18597511/
Kim S., Gupta N., and Pevzner P.A.,
"Spectral Probabilities and Generating Functions of Tandem Mass Spectra: A Strike against Decoy Databases,"
*J Proteome Res.* 2008 Aug; 7(8):3354-63.
[doi: 10.1021/pr8001244](https://doi.org/10.1021/pr8001244)

Source
======
## Contact

https://github.com/MSGFPlus/msgfplus
PNNL Proteomics: proteomics@pnnl.gov
Sangtae Kim: sangtae.kim (at) gmail.com