# **EPI2ME *wf-bacterial-genomes* Workflow: A Training Explanation**

## **Introduction to Bacterial Genome Analysis**

In this section, we'll explore the *wf-bacterial-genomes* workflow, a powerful tool from EPI2ME Labs designed for analyzing bacterial genomes using Oxford Nanopore sequencing data. This workflow is particularly useful for:

* ***De novo*** **Genome Assembly:** Constructing a complete or near-complete genome sequence without relying on a reference genome.  
* **Genome Annotation:** Identifying and labeling genes and other important features within the assembled genome.  
* **Characterizing Bacterial Isolates:** (Optional, using the \--isolates mode) Providing additional information, such as:  
  * Multi-locus sequence typing (MLST)  
  * Antimicrobial resistance (AMR) gene detection  
  * *Salmonella* serotyping

## **Workflow Overview**

The *wf-bacterial-genomes* workflow automates several key steps in bacterial genome analysis. Here's a breakdown of the main stages:

1. **Input Processing:**  
   * The workflow accepts raw sequencing data in FASTQ or BAM format.  
   * It can handle single files, directories of files, or directories containing subdirectories of files.  
   * The fastcat tool concatenates multiple input files (if provided), and bamstats generates per-read statistics, including average read lengths and quality scores. This step ensures that the data is in a suitable format for downstream analysis and provides initial quality control metrics.  
2. **Genome Assembly:**  
   * The workflow performs *de novo* assembly of the bacterial genome. *De novo* assembly means the genome is assembled from scratch, without using a reference genome as a template. This is crucial for novel or poorly characterized organisms.  
3. **Genome Annotation:**  
   * The assembled genome is annotated using Prokka, a rapid prokaryotic genome annotation tool.  
   * Prokka identifies and labels various genomic features, such as protein-coding genes, ribosomal RNA genes, transfer RNA genes, and other non-coding regions.  
   * By default, Prokka uses its built-in databases, but users can customize the annotation process using the \--prokka\_opts parameter.  
4. **Isolate Characterization (Optional):**  
   * If the \--isolates mode is enabled, the workflow performs additional analyses to characterize the bacterial isolate:  
     * **Multi-Locus Sequence Typing (MLST):** MLST is a technique used to classify bacterial isolates based on variations in a set of essential housekeeping genes. The workflow uses PubMLST databases to identify the specific sequence type of the isolate.  
     * **Antimicrobial Resistance (AMR) Calling:** The workflow uses ResFinder to detect genes and single nucleotide polymorphisms (SNPs) associated with resistance to various antimicrobial drugs. This analysis helps understand the resistance profile of the bacteria.  
     * ***Salmonella*** **Serotyping:** For *Salmonella* isolates, the workflow performs serotyping and antigenic profile prediction using SeqSero2. This provides detailed information about the serotype and its antigenic characteristics.

## **Key Concepts and Tools**

* **Oxford Nanopore Sequencing:** This technology produces long reads of DNA, which are particularly useful for *de novo* genome assembly, as they can span repetitive regions and resolve complex genomic structures.  
* **FASTQ/BAM:** These are standard file formats for storing sequencing data. FASTQ files contain the raw sequence reads and their associated quality scores, while BAM files contain aligned reads in a compressed binary format.  
* ***De novo*** **Assembly:** The process of assembling a genome from scratch, without using a reference genome.  
* **Genome Annotation:** The process of identifying and labeling the functional elements within a genome, such as genes, regulatory regions, and other features.  
* **Prokka:** A software tool used to rapidly annotate bacterial, archaeal, and viral genomes.  
* **Nextflow:** A workflow management system used by EPI2ME Labs to define and execute complex bioinformatics pipelines. Nextflow enables parallel processing, handles dependencies, and ensures reproducibility.  
* **MLST (Multi-Locus Sequence Typing):** A technique to characterize bacterial strains using the sequences of several essential housekeeping genes.  
* **AMR (Antimicrobial Resistance):** The ability of microorganisms to withstand the effects of antimicrobial drugs.  
* **ResFinder:** A database and tool for identifying acquired antimicrobial resistance genes in bacterial DNA sequences.  
* **SeqSero2:** A tool for *Salmonella* serotype and antigenic profile prediction.

## **Workflow Benefits**

* **Comprehensive Analysis:** The workflow provides a comprehensive analysis of bacterial genomes, from assembly and annotation to isolate characterization.  
* **Automation:** The workflow automates many of the complex steps involved in genome analysis, making it easier to use and less prone to errors.  
* **Reproducibility:** Using Nextflow ensures that the workflow is reproducible, meaning that the same results can be obtained every time it is run.  
* **Scalability:** The workflow can be run on various computing platforms, from individual workstations to high-performance computing clusters, making it suitable for analyzing both small and large datasets.

## **Running the Workflow**

The *wf-bacterial-genomes* workflow is typically executed using the Nextflow command-line tool. EPI2ME provides detailed instructions and examples on how to install and run the workflow, including specifying input data, setting parameters, and managing the analysis. We will cover the specific commands and parameters in the practical session.

In [1]:
%%bash
cd ~
~/nextflow run epi2me-labs/wf-bacterial-genomes --help


 N E X T F L O W   ~  version 25.04.2

Pulling epi2me-labs/wf-bacterial-genomes ...
 downloaded from https://github.com/epi2me-labs/wf-bacterial-genomes.git
Launching `https://github.com/epi2me-labs/wf-bacterial-genomes` [compassionate_jennings] DSL2 - revision: c8efc3176a [master]

WARN: NEXTFLOW RECURSION IS A PREVIEW FEATURE - SYNTAX AND FUNCTIONALITY CAN CHANGE IN FUTURE RELEASES

||||||||||   _____ ____ ___ ____  __  __ _____      _       _
||||||||||  | ____|  _ \_ _|___ \|  \/  | ____|    | | __ _| |__  ___
|||||       |  _| | |_) | |  __) | |\/| |  _| _____| |/ _` | '_ \/ __|
|||||       | |___|  __/| | / __/| |  | | |__|_____| | (_| | |_) \__ \
||||||||||  |_____|_|  |___|_____|_|  |_|_____|    |_|\__,_|_.__/|___/
||||||||||  wf-bacterial-genomes v1.4.2-gc8efc31
--------------------------------------------------------------------------------
Typical pipeline command:

  nextflow run epi2me-labs/wf-bacterial-genomes \ 
	--fastq 'wf-bacterial-genomes-demo/isolates_fastq' \ 
	--

**Leveraging the JupyterLab Terminal:**

For users working within a JupyterLab environment (such as on Vertex AI), the terminal provides a convenient way to execute Nextflow commands. Here's how you can run the workflow:

1. **Open a Terminal:** In JupyterLab, navigate to "File" \> "New" \> "Terminal". This will open a new terminal window within your JupyterLab interface.  
2. **Execute the Command:** You can directly copy and paste the Nextflow command into the terminal. For example, to run the workflow with the provided parameters, use the following:  

````
~/nextflow run epi2me-labs/wf-bacterial-genomes \
    --fastq "${HOME}/dsc-epi2me-data/wf-bacterial-genomes-demo/isolates_fastq" \
    --isolates \
    --sample_sheet "${HOME}/dsc-epi2me-data/wf-bacterial-genomes-demo/isolates_sample_sheet.csv" \
    --out_dir 'bacterial-genomes-demo_output' \
    -profile standard
```` 

3. **Monitor Execution:** The workflow will begin to execute, and you will see the progress, any error messages, and the final results directly in the terminal.
4. **View the Results:**
    * Once the workflow has completed, the output files will be located in the `bacterial-genomes-demo_output` directory.
    * Locate the report HTML file (`wf-bacterial-genomes.html`) within this directory. This file contains a comprehensive summary of the workflow results.
    * Open the HTML file.  JupyterLab may prompt you to "Trust HTML" at the top left of the file.  If so, click "Trust HTML" to ensure that the report renders correctly and all elements are displayed.

### **Learning Outcomes**

By the end of this section, you should be able to:

* Understand the purpose and applications of the *wf-bacterial-genomes* workflow.  
* Describe the key steps involved in bacterial genome assembly, annotation, and isolate characterization.  
* Identify the main tools and concepts used in the workflow.  
* Appreciate the benefits of using a workflow for bacterial genome analysis.

EPI2ME provides detailed instructions and examples on how to install and run the workflow, including specifying input data, setting parameters, and managing the analysis. We will cover the specific commands and parameters in the practical session.