Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bactopia v2 Overview #233

Closed
80 tasks done
rpetit3 opened this issue Aug 13, 2021 · 26 comments
Closed
80 tasks done

Bactopia v2 Overview #233

rpetit3 opened this issue Aug 13, 2021 · 26 comments
Labels
enhancement New feature or request help wanted Extra attention is needed v2.0.0 Bactopia v2

Comments

@rpetit3
Copy link
Member

rpetit3 commented Aug 13, 2021

Bactopia v2 Overview

With tremendous effort by @Mxrcon and @abhi18av, the foundation for migrating Bactopia to DSL2 has been laid out. This transition represents the key milestone to push Bactopia to v2! (Super excited about this Davi and Abhinav!)

By switching to DSL2, the door for creating custom Bactopia workflows has been opened. For example, let's say you have some Staphylococcus aureus samples, and you want to run Bactopia and then the Bactopia Tool staph-typer. Instead with DSL2, we can create a sub-workflow (e.g. Staphopia) that will automatically run Bactopia and staph-typer. In other words, we can start creating organism-specific sub-workflows, as well as sub-workflows that only include certain steps such as assembly.

I think this also a good time to start cleaning up some things and adding features that will make long-term maintenance more sustainable.

House Cleaning

These are to help reduce the burden required to maintain Bactopia long-term. These are really about standardizing things in such a way that we can automate things. For example, printing usage across each of the workflows can be configured through config files (e.g. nf-core json schema. There are also a lot of shared functions for checking inputs, creating channels, etc. These duplications are no longer necessary in DSL2.

Additional Features

  • Support for Nanopore reads
  • Drop support for some uncompressed outputs (e.g. assemblies)
    • Defaults to compressed outputs, --skip_compression disables this feature
  • GenBank compatible assembly
    • Currently does not like gnl|
  • Outputs from the tutorial https://doi.org/10.6084/m9.figshare.17097156.v1

Implement pytest for testing

I'd like to create a suite of tests that are operated by pytest and pytest-workflows. The nf-core/modules team has a framework that can be extended to Bactopia.

  • Setup Test-Data repo - Done! https://github.com/bactopia/bactopia-tests
  • Setup walk through for testing - Done! https://github.com/bactopia/bactopia/tree/dsl2/tests
  • Add tests to Github Actions -Done! https://github.com/bactopia/bactopia/actions/runs/1256507737
  • Create Tests for Bactopia Modules
    • annotate_genome
    • antimicrobial_resistance
    • ariba_analysis
    • assemble_genome
    • assembly_qc
    • blast
    • call_variants
    • count_31mers (merged into minmer_sketch)
    • download_references (merged into call_variants)
    • estimate_genome_size (merged into gather_samples)
    • fastq_status (merged into gather_samples)
    • gather_samples
    • mapping_query
    • minmer_query
    • minmer_sketch
    • qc_reads
    • sequence_type
  • Create Tests for Bactopia Tool Modules
    • agrvate
    • bakta
    • ectyper
    • emmtyper
    • eggnog
    • fastani
    • hicap
    • ismapper
    • kleborate
    • lissero
    • mashtree
    • meningotype
    • ngmaster
    • pangenome
    • seqsero2
    • spatyper
    • staph-typer
    • staphopiasccmec
    • tbprofiler

Convert some processes to nf-core/modules

There are a few tools used by Bactopia that are the only tool in the process. Most of these tools are in the Bactopia Tools. I think its best that these tools be transferred to nf-core/modules. Many of these will need to be added to nf-core but they are in need of some bacterial genomic tool love, so its ok!

Curated Datasets

I think one of the best features of Bactopia is the ability to include public datasets. This works great for general datasets, but organism-specific datasets are kind of lost. I think it would be great to start a set of curated datasets that users can add data to.

Here's an example of a curated Staphylococcus aureus Bactopia Dataset. This dataset can easily be imported and allow users to rapidly analyze their samples with a curated dataset specific to their organism.

I think it would also be nice if these curated datasets, included SRA accessions linked to publications. But this exceeds my capabilities and would require extensive community support.

Species specific Workflows

With DSL2, we can create Species Specific workflows by combining the main Bactopia workflow with some Bactopia Tools. The main example, and thus shall act as a proof-of-concept will be Staphopia. Staphopia is essentially Bactopia + the Bactopia Tool staph-typer.

  • Create a Staphopia workflow
@rpetit3 rpetit3 added the question Further information is requested label Aug 13, 2021
@rpetit3
Copy link
Member Author

rpetit3 commented Aug 13, 2021

This is a Work-In-Progress, and evolving. Please feel free to provide any feedback and suggestions.

@rpetit3 rpetit3 pinned this issue Aug 13, 2021
@rpetit3 rpetit3 added enhancement New feature or request help wanted Extra attention is needed v2.0.0 Bactopia v2 and removed question Further information is requested labels Aug 13, 2021
@Mxrcon
Copy link
Member

Mxrcon commented Aug 14, 2021

@rpetit3 I completely agree with the design pattern that you're aiming to implement. Creating specific workflows for some genus makes complete sense and could be easily implemented with DSL2. Let me know if i can make myself useful on those tasks!

@abhi18av
Copy link
Member

abhi18av commented Aug 16, 2021

Regarding the inclusion of outlined modules into nf-core, @Mxrcon and I are happy to take this forward as discussed on slack.

I'm dividing these into two groups, for initiating a baseline draft module (we'll be collaborating extensively of course)

@abhi18av and @Mxrcon

P.S. @rpetit3 , if possible could you please add us as collaborators, so that we could start making use of https://github.com/bactopia/bactopia/projects/1 as well?

@kusandeep
Copy link

Looking forward to Bactopia v2.
I am already excited to see those already listed tool in v2. Additionally, I am interested in #157 , pangenome visualisation tools would be great too #156 and Scoary (https://github.com/AdmiralenOla/Scoary) #159

@abhi18av
Copy link
Member

@rpetit3 , I need a bit of guidance here regarding the contributions to nf-core modules.

  1. Which sub-commands exactly do we need from the tools mentioned here Bactopia v2 Overview #233 (comment) . This is necessary to understand since the structure of modules is TOOL/SUB_COMMAND
  2. The nf-core/modules generally rely on https://github.com/nf-core/test-datasets/ repo, which only has human and sars-cov datasets, we might need to decide which bacterial dataset we want to add to test these modules. I'm thinking E. Coli by default, but happy to hear your thoughts.

@rpetit3
Copy link
Member Author

rpetit3 commented Aug 20, 2021

1. Which sub-commands exactly do we need from the tools mentioned here Bactopia v2 Overview #233 (comment) . This is necessary to understand since the structure of modules is TOOL/SUB_COMMAND

I'll get this put together, but I think most of them don't have subcommands

2. The nf-core/modules generally rely on https://github.com/nf-core/test-datasets/ repo, which only has human and sars-cov datasets, we might need to decide which bacterial dataset we want to add to test these modules. I'm thinking E. Coli by default, but happy to hear your thoughts.

I noticed that as well. To get around this I created https://github.com/bactopia/bactopia-tests which is modeled after nf-core's test-datasets, just customized for bactopia. For these tests I'm using Candidatus Portiera aleyrodidarum which only has a genome size of 350kb. Here's some more info https://github.com/bactopia/bactopia/tree/dsl2/tests I've also modeled the tests.config after nf-core/modules (glad I was there for that hackathon!).

The small size genome allows me to test functionality quickly, but I think more importantly with limited resources (my desktop, and eventually GitHub Actions).

Now, I think eventually, another set of genomes (multiple organisms, E coli being one) will be needed as a validation set. For these it will be to make Bactopia is still given the same results, but also larger genomes are not producing errors we wouldn't have seen using a small genome. I think this would be a pre-release test set, but again something we could ideally get to fit on GitHub Action (not sure that'll be possible, unless the memory is increased >7gb on the Linux instance)

@rpetit3
Copy link
Member Author

rpetit3 commented Sep 2, 2021

Small update. I've been doing some cleanup and integrating more nf-core inspirations into v2. The clean up is to make things easier to maintain, and nf-core is because I think they are great practices to follow and by including them now would make a potential nf-core transition for Bactopia easier.

save_files function for publishDir

I added a save_files function that is inspired by nf-core's saveFiles. I opted to change the name slightly to denote the difference. save_files will move files to the appropriate output locations. It allows:

  • you to ignore certain files
  • rename the logs subdirectory (e.g. ariba for multiple datasets, logs/ariba_analysis/card logs/ariba_analysis/vfdb)
  • places any *-error.txt files in the sample base directory
  • uses outputs.config to place outputs into directory by PROCESS_NAME
  • renames results/ directory to the PROCESS_NAME output
    • results/ is a generic name for process outputs that go to a directory
    • This is also adopted from some nf-core workflows

Here's an example of how this looks for qc_reads

    publishDir "${params.outdir}/${sample}",
        mode: params.publish_mode,
        overwrite: params.overwrite,
        saveAs: { filename -> save_files(filename:filename, process_name:PROCESS_NAME, ignore:[ '-genome-size.txt', extra]) }

    output:
    tuple val(sample), val(single_end), path("results/${sample}*.fastq.gz"), emit: fastq, optional: true
    tuple val(sample), val(sample_type), val(single_end), path("results/${sample}*.fastq.gz"), path(extra), path(genome_size), emit: fastq_assembly, optional: true
    path "results/*"
    path "*.std{out,err}.txt", emit: logs, optional: true
    path ".command.*", emit: nf_logs
    path "*.version.txt", emit: version
    path "*-error.txt", optional: true

Main Bactopia Process Cleanup

I've cleaned up the main Bactopia processes significantly.

  • removed check_staging from all processes
  • removed bash traps
  • Nextflow's files (.command.*) are captured as outputs
  • program versions are captured following nf-core's modules setup (e.g. <SOFTWARE_NAME>.version.txt)
  • Where possible I've removed duplicate code which was most common on single-end vs paired-end
  • Capture STDOUT and STDERR to <SOFTWARE>.std{err,out}.txt
  • Updated all tests to check for the new outputs

Process consolidation

While cleaning up, I consolidated a number of processes. In all, 11 processes were merged into other processes, taking the number of processes to maintain from 23 to 13. This was mostly done because these processes tended to be super quick (e.g. making a blast database) so this should reduce overall runtimes due to less overhead associated with starting and stopping processes. I think it also just makes things a bit easier to maintain.

  • blast_genes, blast_primers and blast_proteins are now apart of blast
  • qc_original_summary and qc_final_summary are now apart of qc_reads
  • download_references and call_variants_auto are now apart of call_variants
  • fastq_status and estimate_genome_size are now apart of gather_samples
  • make_blastdb is now apart of assemble_genome
  • count_31mers is now apart of minmer_sketch
    • Counting 31mers is optional and activated by --count_31mers

So the main workflow now looks like this:
Bactopia Workflows (1)

At this point, I think I'm all set to start making subworkflows.

@rpetit3
Copy link
Member Author

rpetit3 commented Sep 4, 2021

Small update, I took the lib folder from nf-core's tools pipeline template, and hacked it to work for Bactopia.

In doing so, this provides:

  • arg parsing and validation
  • usage printing
  • framework for bactopia tools
  • sharable schemas between workflows

It also keeps Bactopia on track for maintaining convergent evolution with nf-core practices, which would make any potential transitions easier.

nextflow run main.nf --help
N E X T F L O W  ~  version 21.04.0
Launching `main.nf` [marvelous_mirzakhani] - revision: 555ddf8329


------------------------------------------------------
   _                _              _
  | |__   __ _  ___| |_ ___  _ __ (_) __ _
  | '_ \ / _` |/ __| __/ _ \| '_ \| |/ _` |
  | |_) | (_| | (__| || (_) | |_) | | (_| |
  |_.__/ \__,_|\___|\__\___/| .__/|_|\__,_|
                            |_|
  bactopia v1.7.1
------------------------------------------------------
Typical pipeline command:

  bactopia --fastqs samples.txt --datasets datasets/ --species 'Staphylococcus aureus' -profile singularity

Required Parameters
  ### For Procesessing Multiple Samples
  --fastqs                     [string]  A FOFN with sample names and paths to FASTQ/FASTAs to process

  ### For Processing A Single Sample
  --R1                         [string]  First set of compressed (gzip) paired-end FASTQ reads (requires --R2 and --sample)
  --R2                         [string]  Second set of compressed (gzip) paired-end FASTQ reads (requires --R1 and --sample)
  --SE                         [string]  Compressed (gzip) single-end FASTQ reads  (requires --sample)
  --hybrid                     [boolean] Treat `--SE` as long reads for hybrid assembly.  (requires --R1, --R2, --SE and --sample)
  --sample                     [string]  Sample name to use for the input sequences

  ### For Downloading from SRA/ENA or NCBI Assembly
  **Note: Downloaded assemblies will have error free Illumina reads simulated for processing.**
  --accessions                 [string]  A file containing ENA/SRA Experiment accessions or NCBI Assembly accessions to processed
  --accession                  [string]  Sample name to use for the input sequences

  ### For Processing an Assembly
  **Note: Assemblies will have error free Illumina reads simulated for processing.**
  --assembly                   [string]  A assembled genome in compressed FASTA format. (requires --sample)

Dataset Parameters
  --datasets                   [string]  The path to datasets that have already been set up
  --species                    [string]  Name of species for species-specific dataset to use

Optional Parameters
  --coverage                   [integer] Reduce samples to a given coverage [default: 100]
  --genome_size                [string]  Expected genome size (bp) for all samples, a value of '0' will disable read error correction and read subsampling,
                                         otherwise estimate with Mash [default: 0]
  --outdir                     [string]  Base directory to write results and Nextflow related outputs to [default: ./]
  --run_name                   [string]  Name of the directory to hold results (e.g. ${params.outdir}/${params.run_name}/<SAMPLE_NAME> [default:
                                         bactopia]

Helpful Parameters
  --available_datasets         [boolean] Print a list of available datasets found based on location given by `--datasets`
  --example_fastqs             [boolean] Print an example FOFN expected from `--fastqs`
  --help_all                   [boolean] An alias for --help --show_hidden_params
  --version                    [boolean] Display version text.

!! Hiding 157 params, use --show_hidden_params (or --help_all) to show them !!
------------------------------------------------------
If you use bactopia for your analysis please cite:

* Bactopia
  https://doi.org/10.1128/mSystems.00190-20

* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x

* Software dependencies
  https://github.com/bactopia/bactopia/blob/master/CITATIONS.md
------------------------------------------------------

Bactopia related additions

  1. Utils.groovy changes
    1. fileExists() - return true if file exists else false
    2. fileNotFound() - return 1 if file does not exist else 0
    3. fileNotGzipped() return 1 if a file is not GZipped else 0
  2. NfcoreSchema.groovy changes
    1. Added header field to print the things like: ### For Procesessing Multiple Samples
    2. Added paramsRequired to only print the required_parameters section
    3. Added --help_all to act as an alias for --help --show_hidden_params
    4. If parameter has choices (e.g. --assembler) print available choices (will follow up with nf-core folks to see if they want a PR)
# nf-core output
* --assembler: sksks is not a valid enum value (sksks)

# bactopia output
* --assembler: 'sksks' is not a valid choice (Available choices: skesa, megahit, spades, velvet)

More Examples

nextflow run main.nf --coverage not_an_integer
N E X T F L O W  ~  version 21.04.0
Launching `main.nf` [nice_pare] - revision: 555ddf8329

ERROR: Validation of pipeline parameters failed!


* --coverage: expected type: Number, found: String (not_an_integer)


nextflow run main.nf --SE bactopia
... TRUNCATED ...
------------------------------------------------------
One or more required parameters are missing, please check and try again.


Required Parameters
  ### For Procesessing Multiple Samples
  --fastqs                     [string]  A FOFN with sample names and paths to FASTQ/FASTAs to process

  ### For Processing A Single Sample
  --R1                         [string]  First set of compressed (gzip) paired-end FASTQ reads (requires --R2 and --sample)
  --R2                         [string]  Second set of compressed (gzip) paired-end FASTQ reads (requires --R1 and --sample)
  --SE                         [string]  Compressed (gzip) single-end FASTQ reads  (requires --sample)
  --hybrid                     [boolean] Treat `--SE` as long reads for hybrid assembly.  (requires --R1, --R2, --SE and --sample)
  --sample                     [string]  Sample name to use for the input sequences

  ### For Downloading from SRA/ENA or NCBI Assembly
  **Note: Downloaded assemblies will have error free Illumina reads simulated for processing.**
  --accessions                 [string]  A file containing ENA/SRA Experiment accessions or NCBI Assembly accessions to processed
  --accession                  [string]  Sample name to use for the input sequences

  ### For Processing an Assembly
  **Note: Assemblies will have error free Illumina reads simulated for processing.**
  --assembly                   [string]  A assembled genome in compressed FASTA format. (requires --sample)

!! Hiding 1 params, use --show_hidden_params (or --help_all) to show them !!
------------------------------------------------------
Output directory (.//bactopia) exists, Bactopia will not continue unless '--force' is used.


ERROR: Validation of pipeline parameters failed!
Please correct to continue
nextflow run main.nf --SE bactopia --sample lll
... TRUNCATED ...
------------------------------------------------------
* --SE: Please verify "bactopia" is compressed using GZIP


Output directory (.//bactopia) exists, Bactopia will not continue unless '--force' is used.


ERROR: Validation of pipeline parameters failed!
Please correct to continue

I think at this point I'm going to get a basic workflow without any datasets, then work on cleaning up the Dataset imports and validations

Also, Bactopia has a lot of parameters! !! Hiding 157 params, use --show_hidden_params (or --help_all) to show them !!

@rpetit3
Copy link
Member Author

rpetit3 commented Sep 11, 2021

V2 is getting super close! I've linked the datasets into the DSL2 framework (also fixed an issue with the cache when using -resume (#212)).

[f9/8fb0ff] process > BACTOPIA:GATHER_SAMPLES (SRX4563634)                      [100%] 1 of 1, cached: 1 ✔
[18/57e2ee] process > BACTOPIA:QC_READS (SRX4563634)                            [100%] 1 of 1, cached: 1 ✔
[aa/2f7e80] process > BACTOPIA:ASSEMBLE_GENOME (SRX4563634)                     [100%] 1 of 1, cached: 1 ✔
[45/67a199] process > BACTOPIA:ASSEMBLY_QC (SRX4563634 - quast)                 [100%] 2 of 2, cached: 1 ✔
[dd/e48402] process > BACTOPIA:ANNOTATE_GENOME (SRX4563634)                     [100%] 1 of 1 ✔
[c5/85af3b] process > BACTOPIA:MINMER_SKETCH (SRX4563634)                       [100%] 1 of 1, cached: 1 ✔
[54/c82daa] process > BACTOPIA:ANTIMICROBIAL_RESISTANCE (SRX4563634)            [100%] 1 of 1 ✔
[83/bb0029] process > BACTOPIA:ARIBA_ANALYSIS (SRX4563634 - vfdb_core)          [100%] 2 of 2, cached: 2 ✔
[f9/e25709] process > BACTOPIA:MINMER_QUERY (SRX4563634 - sourmash-genbank-k31) [100%] 4 of 4, cached: 4 ✔
[2f/de1413] process > BACTOPIA:BLAST (SRX4563634 - blastn)                      [100%] 3 of 3 ✔
[27/a09458] process > BACTOPIA:CALL_VARIANTS (SRX4563634 - GCF_000009645)       [100%] 3 of 3, cached: 3 ✔
[99/750675] process > BACTOPIA:MAPPING_QUERY (SRX4563634)                       [100%] 1 of 1, cached: 1 ✔
[34/177ff6] process > BACTOPIA:SEQUENCE_TYPE (SRX4563634 - default)             [100%] 1 of 1 ✔

    Bactopia Execution Summary
    ---------------------------
    Command Line    : nextflow run ../main.nf --accession SRX4563634 --datasets /home/robert_petit/shannon-wgs/datasets/ --species 'Staphylococcus aureus' -profile singularity --genome_size median -resume
    Resumed         : true
    Completed At    : 2021-09-11T05:20:47.912022Z
    Duration        : 4m 10s
    Success         : true
    Exit Code       : 0
    Error Report    : -
    Launch Dir      : /home/robert_petit/bactopia/temp

Completed at: 11-Sep-2021 05:20:48
Duration    : 4m 10s
CPU hours   : 2.0 (71.6% cached)
Succeeded   : 7
Cached      : 15

I realize a lot of the code is hidden in the modules, but this is the Bactopia Workflow in its entirety. Its so much cleaner, and much easier to work with now

workflow BACTOPIA {
    print_efficiency(RESOURCES.MAX_CPUS) 
    datasets = setup_datasets()

    // Core steps
    GATHER_SAMPLES(create_input_channel(run_type, datasets['genome_size']))
    QC_READS(GATHER_SAMPLES.out.raw_fastq)
    ASSEMBLE_GENOME(QC_READS.out.fastq_assembly)
    ASSEMBLY_QC(ASSEMBLE_GENOME.out.fna, Channel.fromList(['checkm', 'quast']))
    ANNOTATE_GENOME(ASSEMBLE_GENOME.out.fna, Channel.fromPath(datasets['proteins']), Channel.fromPath(datasets['training_set']))
    MINMER_SKETCH(QC_READS.out.fastq)

    // Optional steps that require datasets
    // Species agnostic
    ANTIMICROBIAL_RESISTANCE(ANNOTATE_GENOME.out.annotations, datasets['amr'])
    ARIBA_ANALYSIS(QC_READS.out.fastq, datasets['ariba'])
    MINMER_QUERY(MINMER_SKETCH.out.sketch, datasets['minmer'])

    // Species Specific
    BLAST(ASSEMBLE_GENOME.out.blastdb, datasets['blast'])
    CALL_VARIANTS(QC_READS.out.fastq, datasets['references'])
    MAPPING_QUERY(QC_READS.out.fastq, datasets['mapping'])
    SEQUENCE_TYPE(ASSEMBLE_GENOME.out.fna_fastq, datasets['mlst'])
}

Next I will implementment a species-specific workflow (e.g. Staphopia). Then I think we'll be to the point that we can tidy up and start prepping for the V2 release

@rpetit3
Copy link
Member Author

rpetit3 commented Sep 23, 2021

Small update - I've incorporated nf-core/modules usage of the meta variable to store general info (e.g. id, single_end, etc...). I've always wanted to get away from transporting multiple val inputs/outputs between processes and this does that.

I updated the tests for this change, and I've also implemented this in GitHub Actions. So we can now quickly run all the tests that we're making

@rpetit3
Copy link
Member Author

rpetit3 commented Sep 27, 2021

mf-core/modules is going through some pretty exciting times at the moment related to a new change in version capturing (nf-core/modules#665). These changes are likely to delay some submissions to nf-core/modules.

I'm also finding come cases where the SARS-CoV-2 test data doesn't work well with bacterial tools. To address this I've submitted a PR (nf-core/test-datasets#344) to add some basic bacterial test-data to the modules branch on nf-core/test-datasets.

@rpetit3
Copy link
Member Author

rpetit3 commented Oct 1, 2021

Small update: 11 of 18 bactopia tools have either been merged into nf-core/modules or have an open PR. I think that's pretty awesome! Thank you for the help @abhi18av!

I'm hoping to get a few more PRs open tomorrow.

@rpetit3
Copy link
Member Author

rpetit3 commented Oct 6, 2021

Significant update - Bactopia now officially supports Nanopore reads!

With 8f97b78 and 26207c8

Nanopore support

I have add QC and Assembly support for Nanopore reads

QC is done:

And assembly with done with Dragonflye

Adopting versions.yml

To make using modules from nf-core/modules easier I have ada[ted their versions.yml into bactopia. So for each process a versions.yml will be created. Here is an example from assemble_genome

assemble_genome:
    any2fasta:  0.4.2
    assembly-scan: 0.4.1
    bwa: 0.7.17-r1188
    flash: 1.2.11
    flye: 2.9-b1768
    makeblastdb: 2.12.0+
    medaka: 1.2.2
    megahit: 1.2.9
    miniasm: 0.3-r179
    minimap2: 2.22-r1101
    nanoq: 0.8.2
    pigz: 2.6
    pilon: 1.24
    racon: 1.4.20
    rasusa: 0.6.0
    raven: 1.6.1
    samclip: 0.4.0
    samtools: 1.11
    shovill: 1.1.0
    shovill-se: 1.1.0
    skesa: 2.4.0
    spades.py: 3.15.3
    velvetg: 1.2.10
    velveth: 1.2.10
    unicycler: 0.4.8

I opted to output the versions of all programs in a process even if the specific program wasn't used (e.g. SKESA, SPADES, Flye, etc...) on a given run. Just because I find it esaier to have it all there and take what's needed.

More importantly this will allow us to use nf-core/modules dumpsoftwareversions to merge all the versions.yml

nf-core/modules Submissions

We've now submitted 13 modules to nf-core/modules thats pretty exciting!

I will now start recreating some of the bactopia tools as subworkflows.

@rpetit3
Copy link
Member Author

rpetit3 commented Oct 14, 2021

I could use your opinion. In Bactopia v2, you will be able create custom workflows. For example:

  • Staphopia = Bactopia + Staphtyper (AgrVATE, spatyper, staphopiasccmec)
  • Subset only (assembly only, variant call only, etc...)
  • species specific tools (ectyper, kleborate, hicap, etc...)

You can basically include any available bactopia tools. However, I think some should be excluded and only available independently (e.g. GTDB, eggnog, pirate, roary, etc...), due to runtimes or input DB size requirments.

I'm curious how you would like to do something like this. Some ideas I have are:

Command-line

bactopia ...opts... --staphopia
bactopia ...opts... --tools ectyper,roary,summary
bactopia ...opts... --assembly_only
bactopia ...opts... --add_staphtyper --add_summary

Automated for species (can be disabled)

bactopia ...opts... --datasets /path/to/datasets --species "Staphylococcus aureus" ---> automatically runs StaphTyper
bactopia ...opts... --datasets /path/to/datasets --species "Staphylococcus aureus" --skip_tools ---> does not automatically run StaphTyper
bactopia ...opts... --datasets /path/to/datasets --species "Escherichia coli" ---> automatically runs ectyper

Config file

bactopia ..opts... --custom_wf my_wf.config

my_wf.config:
bactopia
staphtyper

Command to generate static workflows

Each of the above examples are dynamic and have many of moving parts that can cause issues. An alternative would be to create a bactopia command (bactopia generate_wf) to generate a custom Nextflow workflow script for you. This would make the workflow static and allow you to run it directly without all the extra configs (command line parameters, config files, etc...)

I'm open to pretty much anything, so please feel free to toss out ideas.

Tagging a few folks who have given feedback in the past (please feel free to unsubscribe to notifications!) @embatty @lskatz @kusandeep @simone-pignotti @tauqeer9 @uloeber @marcelladane @haruosuz

@Mxrcon
Copy link
Member

Mxrcon commented Oct 15, 2021

The command line seems perfect for new users and seems to be a nice path to follow in the development, but I think that this might be a problem to maintain as every new tool will require a new parameter. a tool for generating new workflows seems awesome but for specific customization It'll require more straightforward users, I'll certainly use it to develop custom wf's.

You think that automated use of the bactopia tools will be a standard on the V2? This seems to be a good addition as the v2 is moving in the direction of species specific workflows.

@rpetit3
Copy link
Member Author

rpetit3 commented Oct 15, 2021

Thank you for the feedback @Mxrcon!

You think that automated use of the bactopia tools will be a standard on the V2? This seems to be a good addition as the v2 is moving in the direction of species specific workflows.

I think yes, but we'll have to strike a balance. There are some tools (e.g. AgrVATE, spaTyper) that are super quick and computationally cheap. For these types of tools I say include them!

But if certain species-specific tools have significant runtime or resource requirements, then I think those should be executed separately.

@simone-pignotti
Copy link

First of all, congrats on all the awesome improvements!
My personal favorite is the use of config files, and I agree with @Mxrcon that although command line params are nice for new users without nextflow expertise, they will be tough to maintain and add up to the 157 params already hidden from the help message :) config files sound like a nice balance between usability and flexibility.
This being said, the bactopia generate_wf idea is also very cool, I would definitely use it for more complex pipelines where I want to add a tool that is not available in bactopia nor nf-core modules (hence solving #222). I think this is pretty important, because no matter how comprehensive and up-to-date the list of tools supported by bactopia is, advanced users will always feel the need to add other ones.
Of course we don't necessarily need bactopia generate_wf to achieve this, we could write the workflow ourselves or use some template with bactopia's workflow already outlined, but it still sounds like an useful feature to me (if it doesn't turn out to be too complex to implement or maintain).

Concerning species-specific workflows that are automatically generated by the species param, it's certainly nice to have but I personally wouldn't mind adding the specific tools I need in a config file.

There are some tools (e.g. AgrVATE, spaTyper) that are super quick and computationally cheap. For these types of tools I say include them!

For sure I would prefer anything that takes more than a few minutes per sample to NOT be automatically added to the workflow!

@rpetit3
Copy link
Member Author

rpetit3 commented Oct 15, 2021

@simone-pignotti thanks as always for the feedback.

I think I'm starting to be convinced on the idea of custom workflows generated by config files, because it kind of creates a pathway for a curated set of user provided workflow configs.

For the power users, I'm thinking in the short term I'll create documentation for the inputs and outputs of each module, as well as make some examples of adding 'custom' modules. Most advanced Nextflow users could probably use this as a baseline and get started pretty quickly hacking away.

@simone-pignotti
Copy link

it kind of creates a pathway for a curated set of user provided workflow configs.

I hadn't thought of it this way, but this may actually be the best reason to opt for the config file solution!

For the custom pipelines, is there an easy way to just add bactopia as process inside a more complex workflow?

@abhi18av
Copy link
Member

abhi18av commented Oct 18, 2021

For the custom pipelines, is there an easy way to just add bactopia as process inside a more complex workflow?

That's a great question @simone-pignotti, I think that since Bactopia V2 is based on Nextflow DSL2, it should be doable. Though given the amount of configs we use, it might be worth testing this officially and then having a showcase extension workflow to guide the community a bit.

Nested params are still somewhat problematic with NF i.e. params.foo.bar.baz etc

@rpetit3
Copy link
Member Author

rpetit3 commented Oct 18, 2021

@simone-pignotti totally agree with @abhi18av 's points

I think for the initial v2 release we might not be quite ready for direct import of bactopia modules (eg. assemble_genome) into your own workflows. However I have been thinking about ways to disentangling the modules and configs to make them portable similar to nf-core/modules.

@rpetit3
Copy link
Member Author

rpetit3 commented Oct 22, 2021

Progress Update - I have rough framework for implementing bactopia-tools into the main bactopia workflow (e.g. custom workflows).

One thing I want to make sure of is that, whether the program is executing as a Bactopia Tool or as a part of the main bactopia workflow, the subworkflow script would be the same (e.g. we don't need to have separate scripts for each).

Here's an overview

Current Framework

workflows.config

params {
    workflows {
        'bactopia' {
            config = "conf/params/bactopia.config"
            includes = ["bactopia"]
            is_subworkflow = false
            path = "workflows/bactopia.nf"
            schema = "conf/schema/bactopia.json"
        }
        'staphopia' {
            includes = ["bactopia", "staphtyper"]
            is_subworkflow = false
            path = "workflows/staphopia.nf"
        }
        'staphtyper' {
            config = "subworkflows/local/staphtyper/params.config"
            ext = "fna"
            is_subworkflow = true
            path = "subworkflows/local/bactopia-tools/staphtyper/main.nf"
            schema = "subworkflows/local/staphtyper/params.json"
        }
    }
}

In the above there are a few things to point out

  • is_subworkflow = true means it can be run as a Bactopia Tool (I chose 'subworkflow' to be consistent with DSL2 terms)
  • is_subworkflow = false means it is a custom Bactopia workflow with additional steps (or the original Bactopia workflow)
  • includes = ["bactopia", "staphtyper"] tells the workflows which config files (e.g. params and schema) to programmatically load

So in this case the staphopia workflow includes bactopia and staphtyper, the parameters from each workflow into staphopia.

IMPORTANT The staphopia workflow script (staphopia.nf) was already built. To my knowledge, currently in DSL2 there is no way to programmatically include modules within scripts. something like:

for module params.workflows[params.wf].includes:
   include params.workflows[module].name from params.workflows[module].path

Instead you must load hard code the modules

include STAPHTYPER from '../subworkflows/local/staphtyper/main'

What this means is custom workflows will have to have pre-built Nextflow scripts. But I think with proper documention and plenty of examples, this shouldn't be an issue for some users.

Added --wf parameter

The magic workflow selector parameter is --wf (eg. workflow) it defaults to --wf bactopia, but allows workflows from the workflows.config to be selected.

Going this route drops the need for a separate bactopia tools script to generate the commands for each Bactopia Tool. Now its all handled by Nextflow, in other words the bactopia-tools,py script can be dropped.

I will still create an alias for bactopia tools to help users transition from V1 to V2. So something like bactopia tools staphtyper will just be an alias for bactopia --wf staphtyper.

Other Changes

Push for container usage when possible

I'm a huge supporter of using Conda, but I think when users can use containers they should. To better indicate this I've added the following message when Conda is used

--------------------------------------------------------------------
WARN: Conda Disclaimer
WARN: If you have access to Docker or Singularity, please consider
WARN: running Bactopia using containers. The containers are less
WARN: susceptible to Conda environment related issues (e.g. version
WARN: conflicts).
WARN:
WARN: To use containers, you can use the profile parameter
WARN:     Docker: -profile docker
WARN:     Singularity: -profile singularity
--------------------------------------------------------------------

I've had users that were aware of -profile docker|singularity and once aware immeadiatly switched to using them.

Suggest -qs usage

Nextflow likes to use all the resources and this can be problematic on shared systems. So I added the following

Each task will use 4 CPUs out of the available 32 CPUs. At most
8 task(s) will be run at a time, this can affect the efficiency
of Bactopia. You can use the '-qs' parameter to alter the number of
tasks to run at a time (e.g. '-qs 2', means only 2 tasks or a maximum
of 8 CPUs will be used at once)

I think this will make it more apparent to users on a shared system that they can use the -qs parameter to play nicely with other users.

Logging Program versions

I adapted the nf-core implementation to work with Bactopia, so now each run you will get a YAML file like this:

Workflow:
  Nextflow: 21.04.0
  bactopia: 1.7.1
annotate_genome:
  prokka: 1.14.6
... TURNCATED ...
qc_reads:
  bbduk: '38.90'
  fastq-scan: 0.4.1
  fastqc: 0.11.9
  lighter: 1.1.2
  nanoplot: 1.38.1
  nanoq: 0.8.2
  pigz: '2.6'
  porechop: 0.2.4
  rasusa: 0.6.0
sequence_type:
  ariba: 2.14.6
  blastn: 2.11.0+

With this framework in place, I'll start the process of getting all the bactopia tools back in place. I think we are pretty close!

@rpetit3
Copy link
Member Author

rpetit3 commented Nov 2, 2021

I think we are getting super close, to being ready for v2 release.

I was able to submit a few more PRs to nf-core/modules (ectyper, tbprofiler, clonalframeml, fastq-scan, ncbi-genome-download). In stead of waiting on their review, I went ahead and implemented them into Bactopia.

I've converted almost all of the v1 Bactopia Tools to DSL2. It's a very manual process at the moment, but not too tedious and easy to work with. But I am grouping the Bactopia Tools into two groups:

  1. Subworkflows - pieces multiple tools together into a single workflow (e.g. pangenome analysis, staphtyper)
  2. Modules - runs a single tool (e.g. ectyper) on multiple samples and merges the results.

There is a parameter --list_wfs that will print available workflows. This is also all handled by Nextflow and config files.

nextflow run ../main.nf --list_wfs
N E X T F L O W  ~  version 21.04.0
Launching `../main.nf` [infallible_mcclintock] - revision: 49a1306a48


---------------------------------------------
   _                _              _
  | |__   __ _  ___| |_ ___  _ __ (_) __ _
  | '_ \ / _` |/ __| __/ _ \| '_ \| |/ _` |
  | |_) | (_| | (__| || (_) | |_) | | (_| |
  |_.__/ \__,_|\___|\__\___/| .__/|_|\__,_|
                            |_|
  bactopia v1.7.1
---------------------------------------------
Below are a list of workflows you can call using the --wf parameter.

Bactopia
  bactopia (default)       Bactopia is a flexible pipeline for complete analysis of bacterial genomes.
  staphopia                Staphopia is a flexible pipeline for complete analysis of Staphylococcus aureus genomes.

Bactopia Tools
Bactopia Tools can include multiple tools (Subworkflows) or a single tool (Modules).

Subworkflows
  staphtyper               Includes AgrVATE, SpaTyper and Staphpopia SCCmec for Staphylococcus aureus

Modules
  agrvate                  Rapid identification of Staphylococcus aureus agr locus type and agr operon variants
  ectyper                  In-silico prediction of Escherichia coli serotype
  fastani                  fast alignment-free computation of whole-genome Average Nucleotide Identity (ANI)
  hicap                    Identify cap locus serotype and structure in your Haemophilus influenzae assemblies
  kleborate                Screen for MLST, sub-species, and other Klebsiella related genes of interest
  mashtree                 Create a trees using Mash distance
  pirate                   Pangenome toolbox for bacterial genomes
  prokka                   Whole genome annotation of small genomes (bacterial, archeal, viral)
  roary                    Rapid large-scale prokaryote pangenome analysis
  spatyper                 Computational method for finding spa types in Staphylococcus aureus
  staphopiasccmec          Primer based SCCmec typing of Staphylococcus aureus genomes

--------------------------------------------------------------------
If you use bactopia for your analysis please cite:

* Bactopia
  https://doi.org/10.1128/mSystems.00190-20

* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x

* Software dependencies
  https://github.com/bactopia/bactopia/blob/master/CITATIONS.md
--------------------------------------------------------------------

I'll have to document this process, but I think for power users, it should be pretty straight forward.

I've set in place a path to have the documentation autogenerated from params.json files. Which will be super useful for keeping things in sync on the docs side.

@abhi18av
Copy link
Member

abhi18av commented Nov 3, 2021

This is awesome - I am learning a lot from the DSl1 -> DSL2 (+ nf-core modules) in Bactopia for some other workflows I've written. Thanks for sharing the updates @rpetit3! 😊

@rpetit3
Copy link
Member Author

rpetit3 commented Nov 29, 2021

Some of these aren't quite ready for V2, or need v2 before they can be accomplished. For now I'm going to put them here, and start a project board to better capture these

Implement pytest for testing

I'd like to create a suite of tests that are operated by pytest and pytest-workflows. The nf-core/modules team has a framework that can be extended to Bactopia.

  • Create Tests for Bactopia Tool Modules
    • gtdb
    • phyloflash

Additional Ways to Run Bactopia

Bactopia, due to Nextflow, can be run on many different environments. But I would like to see it be made easier to run on certain platforms. For example, Terra and CGC, but also for advanced cloud users. For these, it might be best to run them as "Nextflow within Nextflow" in order to reduce the number of times instances are started and stopped. This is because we don't want it taking longer to set up an instance that it does to actually run the task.

  • Platforms
    • Terra.bio
    • Cancer Genomics Cloud
    • Dockstore
    • AWS Cloud Formation
    • NF-Tower

Additional Bactopia Tools

There are some Bactopia Tools I would like to get added into v2. Good thing is DSL2 makes it easier to add new tools in the future.

@rpetit3
Copy link
Member Author

rpetit3 commented Dec 5, 2021

Happy to report v2 has been released https://github.com/bactopia/bactopia/releases/tag/v2.0.0

Thank you very much @Mxrcon and @abhi18av for your help, and everyone for your feedback! Super excited to see where Bactopia goes from here!

@rpetit3 rpetit3 closed this as completed Dec 5, 2021
@rpetit3 rpetit3 unpinned this issue Feb 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed v2.0.0 Bactopia v2
Projects
None yet
Development

No branches or pull requests

5 participants