diff --git a/README.md b/README.md index b0aeecdb..de4a1abf 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,6 @@ -Overview -====== +# MS-GF+ + +[![CI](https://github.com/bigbio/msgfplus/actions/workflows/ci.yml/badge.svg)](https://github.com/bigbio/msgfplus/actions/workflows/ci.yml) MS-GF+ (aka MSGF+ or MSGFPlus) performs peptide identification by scoring MS/MS spectra against peptides derived from a protein sequence database. @@ -13,68 +14,170 @@ It supports a variety of input file formats, including mzML, mzXML, Mascot Generic File (mgf), MS2 files, Micromass Peak List files (pkl), and Concatenated DTA files (_dta.txt). -Requirements -====== +## Requirements + +- Java Runtime 17 or higher (use 64-bit Java) +- At least 2 GB of memory (4 GB+ recommended); larger FASTA files require more memory + +## Installation + +```bash +# Download the latest release JAR +# Place MSGFPlus.jar in any folder +``` + +## Quick Start + +```bash +# Basic search +java -Xmx4G -jar MSGFPlus.jar \ + -s spectra.mzML \ + -d database.fasta \ + -o results.mzid + +# TMT search with target-decoy analysis +java -Xmx8G -jar MSGFPlus.jar \ + -s spectra.mzML \ + -d database.fasta \ + -tda 1 \ + -t 20ppm \ + -ti -1,2 \ + -inst 1 \ + -e 1 \ + -protocol 4 \ + -mod mods.txt \ + -o results.mzid + +# Convert mzid output to TSV +java -cp MSGFPlus.jar edu.ucsd.msjava.ui.MzIDToTsv \ + -i results.mzid \ + -o results.tsv +``` + +## Parameters + +### Required + +| Flag | Name | Description | +|------|------|-------------| +| `-s` | SpectrumFile | Input spectrum file (`*.mzML`, `*.mzXML`, `*.mgf`, `*.ms2`, `*.pkl`, `*_dta.txt`). Spectra should be centroided. | +| `-d` | DatabaseFile | Protein sequence database (`*.fasta`, `*.fa`, `*.faa`). | + +### Core Search Parameters + +| Flag | Name | Default | Description | +|------|------|---------|-------------| +| `-o` | OutputFile | `[input].mzid` | Output file path (`.mzid` format). | +| `-conf` | ConfigurationFile | — | Configuration file; command-line options override config file settings. | +| `-t` | PrecursorMassTolerance | `20ppm` | Precursor mass tolerance (e.g., `2.5Da`, `20ppm`, or `0.5Da,2.5Da` for asymmetric). | +| `-ti` | IsotopeErrorRange | `0,1` | Range of allowed isotope peak errors (e.g., `-1,2`). | +| `-tda` | TDA | `0` | Target-decoy analysis: `0` = don't search decoy database, `1` = search decoy database. | +| `-decoy` | DecoyPrefix | `XXX` | Prefix for decoy protein names in the FASTA file. | + +### Fragmentation and Instrument + +| Flag | Name | Default | Description | +|------|------|---------|-------------| +| `-m` | FragmentationMethodID | `0` | `0` = As written in spectrum or CID if no info, `1` = CID, `2` = ETD, `3` = HCD, `4` = UVPD. | +| `-inst` | InstrumentID | `0` | `0` = Low-res LCQ/LTQ, `1` = Orbitrap/FTICR/Lumos (default for HCD), `2` = TOF, `3` = Q-Exactive. | + +### Enzyme and Digestion + +| Flag | Name | Default | Description | +|------|------|---------|-------------| +| `-e` | EnzymeID | `1` | `0` = Unspecific, `1` = Trypsin, `2` = Chymotrypsin, `3` = Lys-C, `4` = Lys-N, `5` = Glu-C, `6` = Arg-C, `7` = Asp-N, `8` = alphaLP, `9` = No cleavage, `10` = TrypPlusC. | +| `-ntt` | NTT | `2` | Number of tolerable termini: `0` = non-specific, `1` = semi-specific, `2` = fully specific. | +| `-maxMissedCleavages` | MaxMissedCleavages | `-1` | Maximum missed cleavages (`-1` = no limit). | + +### Peptide Filtering + +| Flag | Name | Default | Description | +|------|------|---------|-------------| +| `-minLength` | MinPepLength | `6` | Minimum peptide length to consider. | +| `-maxLength` | MaxPepLength | `40` | Maximum peptide length to consider. | +| `-minCharge` | MinCharge | `2` | Minimum precursor charge (if not in spectrum file). | +| `-maxCharge` | MaxCharge | `3` | Maximum precursor charge (if not in spectrum file). | +| `-msLevel` | MSLevel | `2` | MS level(s) to search (e.g., `2` or `2,3` for MS2+MS3). | + +### Modifications and Protocol + +| Flag | Name | Default | Description | +|------|------|---------|-------------| +| `-mod` | ModificationFileName | — | Modification file path. If not specified, uses standard amino acids with fixed Carbamidomethyl C. | +| `-numMods` | NumMods | `3` | Maximum number of dynamic (variable) modifications per peptide. | +| `-protocol` | ProtocolID | `0` | `0` = Automatic, `1` = Phosphorylation, `2` = iTRAQ, `3` = iTRAQPhospho, `4` = TMT, `5` = Standard. | + +### Output and Performance -Java Runtime 8 or higher (use 64-bit Java)\ -At least 2GB of memory (recommended to use 4GB); larger FASTA files require more memory +| Flag | Name | Default | Description | +|------|------|---------|-------------| +| `-n` | NumMatchesPerSpec | `1` | Number of matches per spectrum to report. Values >1 may skew FDR. | +| `-addFeatures` | AddFeatures | `0` | `0` = basic scores, `1` = additional features (enable for Percolator). | +| `-thread` | NumThreads | All cores | Number of concurrent threads. | +| `-tasks` | NumTasks | `0` | Override task count: `0` = auto, `>0` = exact count, `<0` = multiplier of threads. | +| `-verbose` | Verbose | `0` | `0` = total progress only, `1` = per-thread progress. | +| `-ccm` | ChargeCarrierMass | `1.00727649` | Mass of charge carrier (proton). | -Releases after April 2019 support Java 11 and newer; older releases will not work with Java 11 or newer (at least for reading mzML files) due to the deprecation and removal of some built-in libraries. +### Advanced Parameters -Downloads / Updates -====== -[![Github This Release](https://img.shields.io/github/downloads/MSGFPlus/msgfplus/total.svg)]() [![Github This Release](https://img.shields.io/github/downloads/MSGFPlus/msgfplus/latest/total.svg)]() -* https://github.com/MSGFPlus/msgfplus/releases +| Flag | Name | Default | Description | +|------|------|---------|-------------| +| `-minNumPeaks` | MinNumPeaksPerSpectrum | `10` | Minimum number of peaks per spectrum. | +| `-iso` | NumIsoforms | `128` | Number of isoforms to consider per peptide. | +| `-ignoreMetCleavage` | IgnoreMetCleavage | `0` | `0` = consider N-term Met cleavage, `1` = ignore. | +| `-allowDenseCentroidedPeaks` | AllowDenseCentroidedPeaks | `0` | `0` = skip spectra failing density check, `1` = allow dense centroided spectra. | -*Version number notes* +## Modification File Format + +Modifications are specified in a text file passed via `-mod`. Each line defines a static or dynamic modification: -As of [January 20, 2016 (commit 375d462)](https://github.com/MSGFPlus/msgfplus/commit/375d462e30cbe460b699091a7d6ba52bc192aba1) the version numbering scheme changed. -Previously the version number was the SVN commit number; git does not have simple commit numbers, so MSGFPlus was changed to a date-based version numbering scheme. +``` +# Format: Mass_or_Composition, Residues, ModType, Position, Name -An example: v10282 became v2016.01.20 +# Static modifications +StaticMod=C2H3N1O1, C, fix, any, Carbamidomethyl # Fixed alkylation +StaticMod=229.1629, *, fix, N-term, TMT6plex +StaticMod=229.1629, K, fix, any, TMT6plex -Installation -====== +# Dynamic modifications +DynamicMod=O1, M, opt, any, Oxidation # Oxidized methionine +DynamicMod=HO3P, STY, opt, any, Phospho # Phosphorylation +DynamicMod=H-1N-1O1, NQ, opt, any, Deamidated # Deamidation -Unzip MSGFPlus.zip\ -Place MSGFPlus.jar in any folder +# Position options: any, N-term, C-term, Prot-N-term, Prot-C-term +``` -Usage Information -====== +See [`docs/examples/MSGFPlus_Params.txt`](docs/examples/MSGFPlus_Params.txt) for a complete example configuration file. -Type `java -jar MSGFPlus.jar` for command line arguments. +## Configuration File -To convert an mzid output file into a tsv file, run `java -cp MSGFPlus.jar edu.ucsd.msjava.ui.MzIDToTsv` +All parameters can also be specified in a configuration file (passed via `-conf`). Command-line options override configuration file settings. Use the parameter's config name (e.g., `PrecursorMassTolerance=20ppm`) instead of the flag form. -Alternatively, use the Mzid-To-Tsv-Converter, which is a faster converter that supports larger result files. -It is a C# application that works on Windows or on Linux using mono. -Download the Mzid-To-Tsv-Converter from GitHub. +## Building from Source -For detailed documentation, see the "docs" subfolder, or visit: -* [GitHub project help pages](https://msgfplus.github.io/msgfplus/) -* [GitHub repo HTML help pages - same as above, but may have issues](https://htmlpreview.github.io/?https://github.com/MSGFPlus/msgfplus/blob/master/docs/index.html) -* (previously at https://bix-lab.ucsd.edu/pages/viewpage.action?pageId=13533355) +```bash +# Requires Java 17+ and Maven +mvn package -Contact Information -====== +# Run tests +mvn test -PNNL Proteomics [proteomics@pnnl.gov]\ -Sangtae Kim [sangtae.kim (at) gmail.com] +# The JAR is produced at target/MSGFPlus.jar +``` -Publications -====== +## Publications -"MS-GF+ makes progress towards a universal database search tool for proteomics,"\ -Sangtae Kim and Pavel A Pevzner, -Nat Commun. 2014 Oct 31; 5:5277. doi: 10.1038/ncomms6277.\ -https://pubmed.ncbi.nlm.nih.gov/25358478/ +Kim S. and Pevzner P.A., +"MS-GF+ makes progress towards a universal database search tool for proteomics," +*Nat Commun.* 2014 Oct 31; 5:5277. +[doi: 10.1038/ncomms6277](https://doi.org/10.1038/ncomms6277) -"Spectral Probabilities and Generating Functions of Tandem Mass Spectra: A Strike against Decoy Databases",\ -Sangtae Kim, Nitin Gupta, and Pavel A Pevzner, -J Proteome Res. 2008 Aug; 7(8):3354-63. doi: 10.1021/pr8001244.\ -https://pubmed.ncbi.nlm.nih.gov/18597511/ +Kim S., Gupta N., and Pevzner P.A., +"Spectral Probabilities and Generating Functions of Tandem Mass Spectra: A Strike against Decoy Databases," +*J Proteome Res.* 2008 Aug; 7(8):3354-63. +[doi: 10.1021/pr8001244](https://doi.org/10.1021/pr8001244) -Source -====== +## Contact -https://github.com/MSGFPlus/msgfplus +PNNL Proteomics: proteomics@pnnl.gov +Sangtae Kim: sangtae.kim (at) gmail.com