Skip to content

NeoFlow: a proteogenomics pipeline for neoantigen discovery

Notifications You must be signed in to change notification settings

bzhanglab/neoflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

90 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

NeoFlow: a proteogenomics pipeline for neoantigen discovery

NeoFlow includes four modules:

  1. Variant annotation and customized database construction: neoflow_db.nf;

  2. Variant peptide identification: neoflow_msms.nf;

    • MS/MS searching. Three search engines are available: MS-GF+, X!Tandem and Comet;
    • FDR estimation: global FDR estimation;
    • Novel peptide validation by PepQuery;
    • RT based validation for novel peptide identifications using AutoRT: optional (GPU required).
  3. HLA typing: neoflow_hlatyping.nf;

  4. Neoantigen prediction: neoflow_neoantigen.nf.

NeoFlow supports both label free and iTRAQ/TMT data.

Installation

  1. Download neoflow:
git clone https://github.com/bzhanglab/neoflow
  1. Install Docker (>=19.03).

  2. Install Nextflow. More information can be found in the Nextflow get started page.

  3. Install ANNOVAR by following the instruction at http://annovar.openbioinformatics.org/en/latest/.

  4. Install netMHCpan 4.0 by following the instruction at http://www.cbs.dtu.dk/services/doc/netMHCpan-4.0.readme. Please set TMPDIR in file netMHCpan-4.0/netMHCpan as /tmp as shown below:

# determine where to store temporary files (must be writable to all users)

if ( ${?TMPDIR} == 0 ) then
        setenv  TMPDIR  /tmp
endif
  1. Install nvidia-docker (>=2.2.2) for AutoRT by following the instruction at https://github.com/NVIDIA/nvidia-docker. This is optional and it is only required when users want to use the RT based validation for novel peptide identifications using AutoRT.

All other tools used by NeoFlow have been dockerized and will be automatically installed when NeoFlow is run in the first time on a computer.

Usage

1. Variant annotation and customized database construction

 $ nextflow run neoflow_db.nf --help
N E X T F L O W  ~  version 19.10.0
Launching `neoflow_db.nf` [irreverent_faggin] - revision: 741bf1a931
=========================================
neoflow => variant annotation and customized database construction
=========================================
Usage:
nextflow run neoflow_db.nf
Arguments:
  --vcf_file              A txt file contains VCF file(s)
  --annovar_dir           ANNOVAR folder
  --protocol              The parameter of "protocol" for ANNOVAR, default is "refGene"
  --ref_dir               ANNOVAR annotation data folder
  --ref_ver               The genome version, hg19 or hg38, default is "hg19"
  --out_dir               Output folder, default is "./output"
  --cpu                   The number of CPUs
  --help                  Print help message

The input file for parameter --vcf_file is a tab-delimited text file which contains the path of variant file(s). The variant file can be VCF format or simple text-based format (ANNOVAR input format). The input txt file (a tab-delimited text file) for --vcf_file format is shown below:

experiment sample file file_type
TMT01 T1 T1_somatic.vcf;T1_rna.vcf somatic;rna
TMT01 T2 T2_somatic.vcf;T2_rna.vcf somatic;rna
TMT02 T3 T3_somatic.vcf;T3_rna.vcf somatic;rna
TMT02 T4 T4_somatic.vcf;T4_rna.vcf somatic;rna

The column of experiment is label free, TMT or iTRAQ experiment name and the column of sample is sample name. If it's iTRAQ or TMT data, the samples from the same iTRAQ or TMT experiment should have the same experiment name. If it's label free data, different samples should have different experiment name. All variant files (for example, somatic variant vcf file and variant calling result vcf file based on RNA-Seq data) for the same sample should be in the same row (column file) and different files should be separated by ";". The column of file_type indicates the corresponding variant types for the vcf files in column file. Please note that all variant files should be under the folder where you run neoflow. We recommend users to provide absolute path for each variant file in the input txt file for --vcf_file.

The ANNOVAR annotation data (--annovar_dir) can be downloaded following the instruction at http://annovar.openbioinformatics.org/en/latest/user-guide/download/.

The output files of neoflow_db.nf include customized protein databases in FASTA format for each experiment, variant annotation result files for each sample.

Example

nextflow run neoflow_db.nf --ref_dir /data/tools/annovar/humandb_hg19/ \
                           --vcf_file example_data/test_vcf_files.tsv \
                           --annovar_dir /data/tools/annovar/ \
                           --ref_ver hg19 \
                           --out_dir output

Please update inputs for parameters --ref_dir and --annovar_dir before run the above example. The input file for --vcf_file can be downloaded from the example data prepared for testing. After the example data is downloaded to users' computer, unzip the data and all the testing data are available in the example_data folder.

The running time of above example is less than 5 minutes on a Linux server with 40 cores.

2. Variant peptide identification

Please note that the customized database generated in the first step will be used in this step.

 $ ./nextflow run neoflow_msms.nf --help
N E X T F L O W  ~  version 19.10.0
Launching `neoflow_msms.nf` [drunk_nobel] - revision: 6d58fb19bd
=========================================
neoflow => Variant peptide identification
=========================================
Usage:
nextflow run neoflow-msms.nf
MS/MS searching arguments:
  --db                        The customized protein database (target + decoy sequences) in FASTA format which is generated by neoflow_db.nf
  --ms                        MS/MS data in MGF format
  --msms_para_file            Parameter file for MS/MS searching
  --out_dir                   Output folder, default is "./"
  --prefix                    The prefix of output files
  --search_engine             The search engine used for MS/MS searching, comet=Comet, msgf=MS-GF+ or xtandem=X!Tandem

PepQuery arguments:
  --pv_enzyme                 Enzyme used for protein digestion. 0:Non enzyme, 1:Trypsin (default), 2:Trypsin (no P rule), 3:Arg-C, 4:Arg-C (no P rule), 5:Arg-N, 6:Glu-C, 7:Lys-C
  --pv_c                      The max missed cleavages, default is 2
  --pv_tol                    Precursor ion m/z tolerance, default is 10
  --pv_tolu                   The unit of --tol, ppm or Da. Default is ppm
  --pv_itol                   The error window for fragment ion, default is 0.5
  --pv_fixmod                 Fixed modification. The format is like : 1,2,3. Different modification is represented by different number
  --pv_varmod                 Variable modification. The format is the same with --fixMod;
  --pv_refdb                  Reference protein database

AutoRT parameters:
  --rt_validation             Perform RT based validation
  
  --help                      Print help message

The output files of neoflow_msms.nf include MS/MS searching raw identification files, FDR estimation result files at both PSM and peptide levels, PepQuery validation result files.

Example

nextflow run neoflow_msms.nf --ms example_data/mgf/ \
               --msms_para_file example_data/comet_parameter.txt \
               --search_engine comet \
               --db output/customized_database/neoflow_crc_target_decoy.fasta \
               --out_dir output \
               --pv_refdb output/customized_database/ref.fasta \
               --pv_tol 20 \
               --pv_itol 0.05

The input files for --ms and --msms_para_file can be downloaded from the example data prepared for testing.

The variant peptide identification result is in this file output/novel_peptide_identification/novel_peptides_psm_pepquery.tsv.

The running time of above example is less than 15 minutes on a Linux server with 40 cores.

3. HLA typing

 $ ./nextflow run neoflow_hlatyping.nf --help
N E X T F L O W  ~  version 19.10.0
Launching `neoflow_hlatyping.nf` [spontaneous_hawking] - revision: 5fd970e701
=========================================
neoflow => HLA typing
=========================================
Usage:
nextflow run neoflow_hlatyping.nf
Arguments:
  --reads                     Reads data in fastq.gz or fastq format. For example, "*_{1,2}.fq.gz"
  --hla_ref_dir               HLA reference folder
  --seqtype                   Read type, dna or rna. Default is dna.
  --singleEnd                 Single end or not, default is false (pair end reads)
  --cpu                       The number of CPUs, default is 6.
  --out_dir                   Output folder, default is "./"
  --help                      Print help message

The output of neoflow_hlatyping.nf is a txt format file containing HLA alleles for a sample. This file is generated by OptiType.

Example

nextflow run neoflow_hlatyping.nf --hla_ref_dir example_data/hla_reference \
                  --reads "example_data/dna/*_{1,2}.fastq.gz" \
                  --out_dir output/ \
                  --cpu 40

The input files for --hla_ref_dir and --reads can be downloaded from the example data prepared for testing.

The HLA typing result is in this file output/hla_type/sample1/sample1_result.tsv.

The running time of above example is less than 10 minutes on a Linux server with 40 cores.

4. Neoantigen prediction

Please note that the results generated in step 1-3 will be used in this step.

 $ ./nextflow run neoflow_neoantigen.nf --help
N E X T F L O W  ~  version 19.10.0
Launching `neoflow_neoantigen.nf` [mighty_roentgen] - revision: e4261baca3
=========================================
neoflow => Neoantigen prediction
=========================================
Usage:
nextflow run neoflow_neoantigen.nf
Arguments:
  --var_db                  Variant (somatic) database in fasta format generated by neoflow_db.nf
  --var_info_file           Variant (somatic) information in txt format generated by neoflow_db.nf
  --ref_db                  Reference (known) protein database
  --hla_type                HLA typing result in txt format generated by Optitype
  --netmhcpan_dir           NetMHCpan 4.0 folder
  --var_pep_file            Variant peptide identification result generated by neoflow_msms.nf, optional.
  --var_pep_info            Variant information in txt format for customized database used for variant peptide identification
  --prefix                  The prefix of output files
  --out_dir                 Output directory
  --cpu                     The number of CPUs
  --help                    Print help message

The output of neoflow_neoantigen.nf is a tsv format file containing neoantigen prediction result as shown below:

Variant_ID Chr Start End Ref Alt Variant_Type Variant_Function Gene mRNA Neoepitope Variant_Start Variant_End AA_before AA_after HLA_type netMHCpan_binding_affinity_nM netMHCpan_precentail_rank protein_var_evidence_pep
VAR|NM_002536|10054 chrX 48418659 48418659 G A nonsynonymous SNV protein-altering TBC1D25 NM_002536 TGFGGHRG 1 1 A T HLA-A*01:01 44216.6 88.5537 -
VAR|NM_002536|10054 chrX 48418659 48418659 G A nonsynonymous SNV protein-altering TBC1D25 NM_002536 TGFGGHRG 1 1 A T HLA-C*07:01 43330 73.7774 -
VAR|NM_002536|10054 chrX 48418659 48418659 G A nonsynonymous SNV protein-altering TBC1D25 NM_002536 TGFGGHRG 1 1 A T HLA-B*08:01 35925.8 70.8561 -
VAR|NM_001348265|10055 chrX 48418659 48418659 G A nonsynonymous SNV protein-altering TBC1D25 NM_001348265 TGFGGHRG 1 1 A T HLA-A*01:01 44216.6 88.5537 -
VAR|NM_001348265|10055 chrX 48418659 48418659 G A nonsynonymous SNV protein-altering TBC1D25 NM_001348265 TGFGGHRG 1 1 A T HLA-C*07:01 43330 73.7774 -

Column description for the above table:

Variant_ID:	variant ID defined by neoflow
Chr:	variant chromosome
Start:	start position on genome
End:	end position on genome
Ref:	reference base
Alt:	alterative base
Variant_Type:	variant type annotated by ANNOVAR
Variant_Function:	variant function annotated by ANNOVAR
Gene:	gene ID
mRNA:	mRNA ID
Neoepitope:	neoepitope peptide
Variant_Start:	variant start position on neoepitope peptide
Variant_End:	variant end position on neoepitope peptide
AA_before:	reference amino acid
AA_after:	alterative amino acid
HLA_type:	HLA type
netMHCpan_binding_affinity_nM:	MHC-peptide binding affinity from NetMHCpan 4.0. The lower the value, the higher the binding affinity between MHC and neoepitope peptide.
netMHCpan_precentail_rank:	MHC-peptide binding affinity rank from NetMHCpan 4.0
protein_var_evidence_pep:	variant peptide. "-" means no variant peptide identified covers the mutation site.

Example

nextflow run neoflow_neoantigen.nf --prefix sample1 \
                   --hla_type output/hla_type/sample1/sample1_result.tsv \
                   --var_db output/customized_database/sample1-somatic-var.fasta \
                   --var_info_file output/customized_database/sample1-somatic-varInfo.txt \
                   --out_dir output/ \
                   --netmhcpan_dir /data/tools/netMHCpan-4.0/ \
                   --cpu 40 \
                   --ref_db output/customized_database/ref.fasta \
                   --var_pep_file output/novel_peptide_identification/novel_peptides_psm_pepquery.tsv \
                   --var_pep_info output/customized_database/neoflow_crc_anno-varInfo.txt

Please update input for parameter --netmhcpan_dir before run the above example.

The neoantigen prediction result is in this file output/neoantigen_prediction/sample1_neoepitope_filtered_by_reference_add_variant_protein_evidence.tsv.

The running time of above example is less than 30 minutes on a Linux server with 40 cores.

Example data

The test data used for above examples can be downloaded by clicking test data .

How to cite:

Wen, B., Li, K., Zhang, Y. et al. Cancer neoantigen prioritization through sensitive and reliable proteogenomics analysis. Nature Communications 11, 1759 (2020). https://doi.org/10.1038/s41467-020-15456-w