Skip to content
Young edited this page Feb 23, 2024 · 18 revisions

Running Grandeur

Grandeur is a nextflow workflow for working with bacterial isolates in a public health setting.

Quickstart

See USAGE for more information and Examples for examples.

nextflow run UPHL-BioNGS/Grandeur -profile docker --sample_sheet sample_sheet.csv --outdir /path/to/where/results/will/be/copied

The Problem

All "good" bioinformatic tools and workflows are trying to solve a problem. At UPHL, we ran into a problem that there was not a workflow that met our sequencing analysis needs for bacterial isolates.

We needed a bioinformatic workflow to replace the bench experiments for

  • E. coli O and H characterization
  • Shigella CadA and IpaH characterization
  • Vibrio speciation
  • Salmonella serotyping

We also needed something to assist our local epidemiologists in outbreak investigations

  • Species agnostic core genome alignment and phylogenetic trees
  • SNP matrices
  • AMR gene identification

Grandeur is species-agnostic at its core, although certain organisms undergo some species-specific processes. Future directions include speciation of unknown isolates. Our initial testing seemed promising, but we have not tested this use enough to make it widely available.

Grandeur includes a de novo assembly workflow, but can also run on contig/fasta files generated from other workflows. Our most common files are those generated from PHOENIX, CDC's ARLN workflow, and DONUT FALLS, a UPHL-generated nanopore sequencing workflow.

More information about using Grandeur and what its subworkflows do can be found in this Wiki.

If you are running into bugs or other issues, please post in the issues tracker.

All parameters and their default values

// directory for results
params.outdir               = "grandeur"

// input files
params.reads                = ""
params.fastas               = ""
params.sample_sheet         = ""
params.fasta_list           = ""

// external files
params.kraken2_db           = ""
params.blast_db             = ""
params.mash_db              = ""
params.fastani_ref          = ""
params.fastani_ref_list     = ""
params.genome_sizes         = workflow.projectDir + "/assets/genome_sizes.json"

// for downloading from databases
params.sra_accessions       = []

// thresholds and other params
params.minimum_reads        = 10000
params.datasets_max_genomes = 5
params.mash_max_hits        = 25
params.min_core_genes       = 1500
params.iqtree2_outgroup     = ""

// subworkflow flags
params.current_datasets     = false
params.skip_extras          = false
params.exclude_top_hit      = false
params.msa                  = false

Further information on each of these params and how they affect the workflow are on different pages of the wiki.

Final file structure

A directory will produce files at 'grandeur' in where the command was inputted, but this can also be adjusted with 'params.outdir' or '--outdir'.

grandeur/
├── grandeur_summary.tsv
├── grandeur_summary.txt                                                           # a table with a summary from all the serotyping and QC tools
├── aligned
│   ├── sample.sorted.bam
│   └── sample.sorted.bam.csi
├── bbduk
│   ├── sample.matched_phix.fq
│   ├── sample.phix.stats.txt
│   ├── sample_rmphix_R1.fastq.gz
│   └── sample_rmphix_R2.fastq.gz
├── blastn
│   └── sample.tsv
├── blobtools
│   ├── sample.sample.sorted.bam.cov
│   ├── sample.blobDB.json
│   ├── sample.blobDB.json.bestsum.species.p8.span.100.blobplot.bam0.png
│   ├── sample.blobDB.json.bestsum.species.p8.span.100.blobplot.read_cov.bam0.png
│   ├── sample.blobDB.json.bestsum.species.p8.span.100.blobplot.stats.txt          # Genus and species of the reads
│   └── sample.blobDB.table.txt
├── contigs
│   └── sample_contigs.fa                                                          # fasta file of contigs
├── datasets
│   ├── Organism_specific_genomes.csv
│   └── datasets_summary.csv
├── emmtyper
├── fastani
│   ├── fastani.out
│   └── sample.txt
├── fastani
│   ├── fastani_summary.csv
│   ├── sample.txt
│   └── sample.txt.matrix
├── fastp
│   ├── sample_fastp.html
│   ├── sample_fastp.json
│   ├── sample_fastp_R1.fastq.gz
│   └── sample_fastp_R2.fastq.gz
├── fastqc
│   ├── sample_fastqc.html
│   ├── sample_fastqc.zip
│   ├── sample_fastqc.html
│   └── sample_fastqc.zip
├── flag
│   ├── flag_summary.csv
│   └── sample_flag.csv
├── gff
│   └── sample.gff                                                                 # gff file created by prokka
├── grandeur_results.tsv                                                           # summary file
├── iqtree2
│   ├── iqtree.ckp.gz
│   ├── iqtree.contree                                                             # treefile without node values
│   ├── iqtree.iqtree
│   ├── iqtree.log
│   ├── iqtree.splits.nex
│   └── iqtree.treefile
├── kleborate
│   ├── sample_results.txt
│   └── kleborate_results.txt                                                      # klebsiella hypervirulence scoring
├── kraken2
│   └── sample_kraken2_report.txt
├── legsta
├── logs
├── mash
│   ├── sample_mashdist.txt                                                        # mash distances
│   └── sample.msh
├── mlst
│   ├── sample_mlst.txt
│   └── mlst_result.tsv                                                            # mlst of organism (if found)
├── multiqc
│   ├── multiqc_data
│   │   └── *
│   └── multiqc_report.html
├── ncbi-AMRFinderplus
│   ├── amrfinderplus.txt
│   └── sample_amrfinder_plus.txt
├── pbptyper
├── plasmidfinder
│   └── sample
│       ├── data.json
│       └── tmp
│           ├── out_enterobacteriaceae.xml
│           ├── out_Inc18.xml
│           ├── out_NT_Rep.xml
│           ├── out_Rep1.xml
│           ├── out_Rep2.xml
│           ├── out_Rep3.xml
│           ├── out_RepA_N.xml
│           ├── out_RepL.xml
│           └── out_Rep_trans.xml
├── prokka                                                                         # optional, but may save time by pre-generating gff files
│   └── sample
│       ├── sample.err
│       ├── sample.faa
│       ├── sample.ffn
│       ├── sample.fna
│       ├── sample.fsa
│       ├── sample.gbk
│       ├── sample.gff                                                             # annotated contig file that can be used via roary
│       ├── sample.log
│       ├── sample.sqn
│       ├── sample.tbl
│       ├── sample.tsv
│       └── sample.txt
├── quast
│   ├── sample
│   │   ├── basic_stats
│   │   │   ├── cumulative_plot.pdf
│   │   │   ├── GC_content_plot.pdf
│   │   │   ├── sample_GC_content_plot.pdf
│   │   │   └── Nx_plot.pdf
│   │   ├── icarus.html
│   │   ├── icarus_viewers
│   │   │   └── contig_size_viewer.html
│   │   ├── quast.log
│   │   ├── report.html
│   │   ├── report.pdf
│   │   ├── report.tex
│   │   ├── report.tsv
│   │   ├── report.txt
│   │   ├── transposed_report.tex
│   │   ├── transposed_report.tsv
│   │   └── transposed_report.txt
│   └── report.tsv                                                               # QC for contigs
├── roary
│   ├── accessory_binary_genes.fa
│   ├── accessory_binary_genes.fa.newick
│   ├── accessory_graph.dot
│   ├── accessory.header.embl
│   ├── accessory.tab
│   ├── blast_identity_frequency.Rtab
│   ├── clustered_proteins
│   ├── core_accessory_graph.dot
│   ├── core_accessory.header.embl
│   ├── core_accessory.tab
│   ├── core_alignment_header.embl
│   ├── core_gene_alignment.aln                                                 # core genome alignment
│   ├── fixed_input_files
│   │   └── sample.gff
│   ├── gene_presence_absence.csv
│   ├── gene_presence_absence.Rtab
│   ├── number_of_conserved_genes.Rtab
│   ├── number_of_genes_in_pan_genome.Rtab
│   ├── number_of_new_genes.Rtab
│   ├── number_of_unique_genes.Rtab
│   ├── pan_genome_reference.fa
│   └── summary_statistics.txt                                                 # important file with the number of genes involved in core genome
├── seqsero2
│   ├── sample
│   │   ├── sample_H_and_O_and_specific_genes.fasta_mem.fasta
│   │   ├── blasted_output.xml
│   │   ├── data_log.txt
│   │   ├── Extracted_antigen_alleles.fasta
│   │   ├── SeqSero_log.txt
│   │   ├── SeqSero_result.tsv
│   │   └── SeqSero_result.txt
│   └── SeqSero_result.tsv                                                       # Salmonella serotypes
├── serotypefinder
│   └── sample
│       ├── data.json
│       ├── Hit_in_genome_seq.fsa
│       ├── results_tab.tsv                                                      # E. coli serotypes
│       ├── results.txt
│       ├── Serotype_allele_seq.fsa
│       └── tmp
│           ├── out_H_type.xml
│           └── out_O_type.xml
├── shigatyper                                                                                 # Shigatyper serotypes
│   ├── shigatyper_results.txt
│   └── sample_shigatyper.tsv                  
├── size
│   ├── size_results.csv
│   └── sample_size.csv
├── spades
│   └── sample
│       ├── assembly_graph_after_simplification.gfa
│       ├── assembly_graph.fastg
│       ├── assembly_graph_with_scaffolds.gfa
│       ├── before_rr.fasta
│       ├── contigs.fasta
│       ├── contigs.paths
│       ├── dataset.info
│       ├── input_dataset.yaml
│       ├── K127
│       │   ├── assembly_graph_after_simplification.gfa
│       │   ├── assembly_graph.fastg
│       │   ├── assembly_graph_with_scaffolds.gfa
│       │   ├── before_rr.fasta
│       │   ├── configs
│       │   │   ├── careful_mda_mode.info
│       │   │   ├── careful_mode.info
│       │   │   ├── config.info
│       │   │   ├── construction.info
│       │   │   ├── detail_info_printer.info
│       │   │   ├── distance_estimation.info
│       │   │   ├── hmm_mode.info
│       │   │   ├── isolate_mode.info
│       │   │   ├── large_genome_mode.info
│       │   │   ├── mda_mode.info
│       │   │   ├── meta_mode.info
│       │   │   ├── metaplasmid_mode.info
│       │   │   ├── metaviral_mode.info
│       │   │   ├── moleculo_mode.info
│       │   │   ├── pe_params.info
│       │   │   ├── plasmid_mode.info
│       │   │   ├── rna_mode.info
│       │   │   ├── rnaviral_mode.info
│       │   │   ├── simplification.info
│       │   │   ├── toy.info
│       │   │   └── tsa.info
│       │   ├── final_contigs.fasta
│       │   ├── final_contigs.paths
│       │   ├── final.lib_data
│       │   ├── path_extend
│       │   ├── scaffolds.fasta
│       │   └── scaffolds.paths
│       ├── K21
│       │   ├── configs
│       │   │   ├── careful_mda_mode.info
│       │   │   ├── careful_mode.info
│       │   │   ├── config.info
│       │   │   ├── construction.info
│       │   │   ├── detail_info_printer.info
│       │   │   ├── distance_estimation.info
│       │   │   ├── hmm_mode.info
│       │   │   ├── isolate_mode.info
│       │   │   ├── large_genome_mode.info
│       │   │   ├── mda_mode.info
│       │   │   ├── meta_mode.info
│       │   │   ├── metaplasmid_mode.info
│       │   │   ├── metaviral_mode.info
│       │   │   ├── moleculo_mode.info
│       │   │   ├── pe_params.info
│       │   │   ├── plasmid_mode.info
│       │   │   ├── rna_mode.info
│       │   │   ├── rnaviral_mode.info
│       │   │   ├── simplification.info
│       │   │   ├── toy.info
│       │   │   └── tsa.info
│       │   ├── final.lib_data
│       │   └── simplified_contigs
│       │       ├── contigs_info
│       │       ├── contigs.off
│       │       └── contigs.seq
│       ├── K33
│       │   ├── configs
│       │   │   ├── careful_mda_mode.info
│       │   │   ├── careful_mode.info
│       │   │   ├── config.info
│       │   │   ├── construction.info
│       │   │   ├── detail_info_printer.info
│       │   │   ├── distance_estimation.info
│       │   │   ├── hmm_mode.info
│       │   │   ├── isolate_mode.info
│       │   │   ├── large_genome_mode.info
│       │   │   ├── mda_mode.info
│       │   │   ├── meta_mode.info
│       │   │   ├── metaplasmid_mode.info
│       │   │   ├── metaviral_mode.info
│       │   │   ├── moleculo_mode.info
│       │   │   ├── pe_params.info
│       │   │   ├── plasmid_mode.info
│       │   │   ├── rna_mode.info
│       │   │   ├── rnaviral_mode.info
│       │   │   ├── simplification.info
│       │   │   ├── toy.info
│       │   │   └── tsa.info
│       │   ├── final.lib_data
│       │   └── simplified_contigs
│       │       ├── contigs_info
│       │       ├── contigs.off
│       │       └── contigs.seq
│       ├── K55
│       │   ├── configs
│       │   │   ├── careful_mda_mode.info
│       │   │   ├── careful_mode.info
│       │   │   ├── config.info
│       │   │   ├── construction.info
│       │   │   ├── detail_info_printer.info
│       │   │   ├── distance_estimation.info
│       │   │   ├── hmm_mode.info
│       │   │   ├── isolate_mode.info
│       │   │   ├── large_genome_mode.info
│       │   │   ├── mda_mode.info
│       │   │   ├── meta_mode.info
│       │   │   ├── metaplasmid_mode.info
│       │   │   ├── metaviral_mode.info
│       │   │   ├── moleculo_mode.info
│       │   │   ├── pe_params.info
│       │   │   ├── plasmid_mode.info
│       │   │   ├── rna_mode.info
│       │   │   ├── rnaviral_mode.info
│       │   │   ├── simplification.info
│       │   │   ├── toy.info
│       │   │   └── tsa.info
│       │   ├── final.lib_data
│       │   └── simplified_contigs
│       │       ├── contigs_info
│       │       ├── contigs.off
│       │       └── contigs.seq
│       ├── K77
│       │   ├── configs
│       │   │   ├── careful_mda_mode.info
│       │   │   ├── careful_mode.info
│       │   │   ├── config.info
│       │   │   ├── construction.info
│       │   │   ├── detail_info_printer.info
│       │   │   ├── distance_estimation.info
│       │   │   ├── hmm_mode.info
│       │   │   ├── isolate_mode.info
│       │   │   ├── large_genome_mode.info
│       │   │   ├── mda_mode.info
│       │   │   ├── meta_mode.info
│       │   │   ├── metaplasmid_mode.info
│       │   │   ├── metaviral_mode.info
│       │   │   ├── moleculo_mode.info
│       │   │   ├── pe_params.info
│       │   │   ├── plasmid_mode.info
│       │   │   ├── rna_mode.info
│       │   │   ├── rnaviral_mode.info
│       │   │   ├── simplification.info
│       │   │   ├── toy.info
│       │   │   └── tsa.info
│       │   ├── final.lib_data
│       │   └── simplified_contigs
│       │       ├── contigs_info
│       │       ├── contigs.off
│       │       └── contigs.seq
│       ├── K99
│       │   ├── configs
│       │   │   ├── careful_mda_mode.info
│       │   │   ├── careful_mode.info
│       │   │   ├── config.info
│       │   │   ├── construction.info
│       │   │   ├── detail_info_printer.info
│       │   │   ├── distance_estimation.info
│       │   │   ├── hmm_mode.info
│       │   │   ├── isolate_mode.info
│       │   │   ├── large_genome_mode.info
│       │   │   ├── mda_mode.info
│       │   │   ├── meta_mode.info
│       │   │   ├── metaplasmid_mode.info
│       │   │   ├── metaviral_mode.info
│       │   │   ├── moleculo_mode.info
│       │   │   ├── pe_params.info
│       │   │   ├── plasmid_mode.info
│       │   │   ├── rna_mode.info
│       │   │   ├── rnaviral_mode.info
│       │   │   ├── simplification.info
│       │   │   ├── toy.info
│       │   │   └── tsa.info
│       │   ├── final.lib_data
│       │   └── simplified_contigs
│       │       ├── contigs_info
│       │       ├── contigs.off
│       │       └── contigs.seq
│       ├── misc
│       │   └── broken_scaffolds.fasta
│       ├── params.txt
│       ├── pipeline_state
│       │   ├── stage_0_before_start
│       │   ├── stage_10_bs
│       │   ├── stage_11_terminate
│       │   ├── stage_1_as_start
│       │   ├── stage_2_k21
│       │   ├── stage_3_k33
│       │   ├── stage_4_k55
│       │   ├── stage_5_k77
│       │   ├── stage_6_k99
│       │   ├── stage_7_k127
│       │   ├── stage_8_copy_files
│       │   └── stage_9_as_finish
│       ├── run_spades.sh
│       ├── run_spades.yaml
│       ├── scaffolds.fasta
│       ├── scaffolds.paths
│       ├── spades.log
│       └── tmp
├── snp-dists
│   ├── roary_metrics_mqc.csv
│   ├── SNP_matrix_mqc.png
│   ├── SNP_matrix.pdf
│   ├── SNP_matrix.png                      # image of SNP matrix
│   ├── snp_matrix.txt                      # SNP matrix counting the number of SNPs that each sample differs by
│   └── snp_matrix_with_qc.txt              # SNP matrix with QC information
└── summary
    ├── grandeur_extended_summary.tsv
    ├── grandeur_extended_summary.txt
    ├── input_files.csv
    └── sample_names.csv                

Example results files for each analysis and why they matter are in a different subsection of the wiki labelled processes.

Clone this wiki locally