# **Walkthrough: get\_all\_stats.sh \- FASTQ Summary Statistics Script**

This script uses SeqKit stats to summarize FASTQ files from your main analysis pipeline. It processes two types of files:

1. **Treatment-Specific FASTQs**: Initial filtered files for each treatment (e.g., "AS", "Control").  
2. **Mapped Isolate-Specific FASTQs**: Files with reads mapped to individual isolates within each treatment.

Two separate summary .tsv files will be created.

**Prerequisites:**

* Main analysis pipeline successfully run.  
* SeqKit installed and in your PATH.

## **Script Breakdown**

### **Section 0: Script Header, Behavior, and Configuration**

* **Shebang (\#\!/bin/bash)**: Tells the system to use bash to run the script.  
* **set \-e \-u \-o pipefail**: Makes the script safer by exiting on errors, treating unset variables as errors, and correctly handling errors in command pipelines.  
* **SAMPLEID from Command Line**: The script requires a SAMPLEID to be given when you run it (e.g., ./get\_all\_stats.sh my\_sample).  
* **THREADS=4**: Sets how many processor threads SeqKit can use. Adjust as needed.  
* **Directory Variables**: Defines where the script will look for FASTQ files and save the summary reports, based on the SAMPLEID and a standard output structure (${HOME}/Output\_AS).

### **Section 1: Create Summary Directory and Check for SeqKit**

* **mkdir \-p "${SUMMARY\_DIR}"**: Creates the summary\_data directory (inside your sample's output folder) if it doesn't already exist. The \-p option prevents errors if the directory is already there.  
* **if \! command \-v seqkit ...**: Checks if the seqkit tool is installed and can be found by the system. If not, it prints an error and exits.

### **Section 2: Generate Stats for Treatment-Specific FASTQ Files**

This part processes the initial FASTQ files created for each treatment (e.g., AS and Control).

* **TRMT\_FASTQ\_INPUT\_PATTERN**: This defines how to find the FASTQ files. The \* are wildcards:  
  * The first \* matches treatment subfolders (like AS or Control).  
  * The second \* matches the treatment name within the filenames.  
* **File Existence Check**: Before running seqkit, the script uses ls to see if any files actually match the pattern. If not, it skips this step.  
* **seqkit stats ...**: This is the main command.  
  * \-a: Get all statistics.  
  * \-b: Use just the filename (basename) in the report.  
  * \-j `"${THREADS}"`: Use multiple threads.  
  * \-T: Output as a tab-separated table.  
  * `${TRMT\_FASTQ\_INPUT\_PATTERN}`: Provides the list of files (wildcards expanded by the shell).  
  * \-o `"${TRMT\_STATS\_OUTPUT\_FILE}"`: Saves the output to a .tsv file.  
* **Verification**: Checks if seqkit ran successfully and created a non-empty output file, then prints the summary.

### **Section 3: Generate Stats for Mapped Isolate-Specific FASTQ Files**

This part processes the FASTQ files containing reads that mapped to specific isolates within each treatment (e.g., `...mapped.AS.Bacillus.fastq.gz`).

* **ISO\_FASTQ\_INPUT\_PATTERN**: This pattern is for finding the isolate-specific mapped FASTQ files.  
  * The \*\_\* matches subfolders named like AS\_Bacillus or Control\_Listeria.  
  * The .\*.\*. part matches the treatment and isolate names in the actual filenames.  
* The rest of the logic (file check, seqkit stats command, verification) is similar to Section 2, but applied to these isolate-specific FASTQ files and a different output file.

## **How to Run This Script**

**Run with SAMPLEID:** Execute the script with your sample ID:  
   `bash dsc_workshop_2025/scripts/get_all_stats.sh your_sample_id_here`

   The script will then generate the two summary .tsv files in the `summary_data` folder for that sample.

* Note
If you want to try another sample change `jetsonhack_ORANGE_run1_20220207` to another sequencing run in the `dsc-nanopore-datea/as_data` folder (e.g., `DSC_RAD_Enrich`)