# **Bioinformatics Pipeline Walkthrough**

This document provides a step-by-step walkthrough of the adaptive sampling analysis pipeline. Each section includes an explanation and a corresponding bash code block that you can copy and run in your terminal. After many steps, there are "Peek at the Output" sections with commands to inspect the generated files.

**Prerequisites:**

* Ensure all necessary bioinformatics tools are installed (Minimap2, Samtools, Seqtk, Awk, Pigz, etc.).  
* Your input data should be structured as expected by the script.  
* You should be in the directory where you want the main output folder (Output\_AS) to be created (e.g., ${HOME}).

**Important:** Run these blocks sequentially in the **same terminal session**, as variables defined in earlier blocks are used in later ones.

## **Section 0: Configuration & Initial Setup**

This first block sets up crucial variables for the entire pipeline.

* **SAMPLEID**: **You MUST modify this variable** to match the specific sample you are processing.  
* **THREADS**: This is dynamically set to the number of available processing units on your system using nproc. You can optionally cap this.  
* **BUCKET\_DIR**: This points to the location of your input data. Adjust if your data is elsewhere. The script assumes a structure like ${BUCKET\_DIR}/${SAMPLEID}/basecalled/....  
* **REFERENCE\_BASENAME**: The filename of your reference FASTA.  
* **MAIN\_OUTPUT\_ROOT**: The root directory where all sample-specific output folders will be created.  
* Other variables define names for intermediate files, lists for looping, and organism-specific reference names.

**Copy and paste the entire block below into your terminal and press Enter.**

### **Peek at the Initial Setup (Optional)**

After running the block above, you can check if the main directories are accessible or if the output directory is ready to be created.

## **Section 1: Find Sequencing Summary & Determine Column Numbers**

This section dynamically locates your sequencing\_summary\*.txt file and then reads its header to determine the column numbers for "channel", "passes\_filtering", "sequence\_length\_template", and "read\_id". This makes the script robust to changes in column order.

### **Peek at Column Determination (Optional)**

echo "--- Peeking at Column Determination ---"
echo "Found sequencing summary file: ${BASECALLED_SEQ_SUMMARY}"
echo "Header line used for column mapping: "
echo "${HEADER_LINE}"
echo "Determined column numbers:"
echo "  Channel: ${COL_CHANNEL_NUM}"
echo "  Passes Filtering: ${COL_PASSES_FILTERING_NUM}"
echo "  Sequence Length: ${COL_SEQUENCE_LENGTH_TEMPLATE_NUM}"
echo "  Read ID: ${COL_READ_ID_NUM}"
echo "--- End Peeking ---"
echo ""

## **Section 2: Create Output Directories & Verify Inputs**

This block creates the necessary directory structure for all output files and then verifies that essential input files (reference, sequencing summary, FASTQ directory) exist.

### **Peek at Created Directories (Optional)**

## **Section 3: Part 1 \- Splitting Output by Treatment**

This part of the pipeline focuses on initial read processing:

1. **Concatenate Pass Reads**: All fastq.gz files from the basecalled/fastq\_pass directory are combined into a single file.  
2. **Process Each Treatment**: A loop iterates through "AS" and "Control" treatments. For each:  
   * A filtered sequencing summary is created based on channel, passes\_filtering status, and sequence length.  
   * A list of read IDs matching these criteria is generated.  
   * A filtered FASTQ file for the treatment is created using seqtk.

### **Peek at Concatenated Reads**

Now, continue with the loop for processing each treatment.

## **Section 4: Part 1b \- Combine Treatment-Specific Sequencing Summaries**

After creating individual sequencing summaries for each treatment, this step combines them into a single file, adding a new "TRMT" column to indicate the origin of each row.

### **Peek at Combined Treatment Summary**

## **Section 5: Part 2 \- Analyzing the Output (Mapping & Isolate Processing)**

This is the main analysis part. It loops through each treatment ("AS", "Control"). For each treatment:

1. **Map to Entire Community**: Reads are mapped to the full reference genome using minimap2. The resulting BAM file is sorted and indexed.  
2. **Extract All Mapped Reads**: A list of all read IDs that mapped to any part of the reference is created. A FASTQ file containing these mapped reads is generated, along with a corresponding sequencing summary.  
3. **Process Individual Isolates**: An inner loop iterates through each defined isolate (Bacillus, E. coli, etc.).  
   * The community BAM file is filtered to get reads mapping specifically to the current isolate's reference sequence(s).  
   * This isolate-specific BAM is indexed.  
   * A list of read IDs mapped to this isolate is generated.  
   * A FASTQ file and sequencing summary for these isolate-specific reads are created.

**Note**: This is a large block as it contains nested loops. It's designed to be run as one unit.

## **Section 6: Part 3 \- Combine Mapped Isolate-Specific Sequencing Summaries**

Finally, this section takes all the individual sequencing summaries created for each isolate within each treatment (`.mapped.${TRMT}.${ISO}.seqsum.txt`) and combines them into one large summary file. It adds two new columns, "TRMT" and "ISO", to indicate the source of each row.

### **Peek at Final Combined Mapped Isolate Summary**