# 🧬 Routine WGS QC and Assembly Notebook

This notebook documents the **routine workflow for performing Quality Control (QC)** and **Genome Assembly**  
on Illumina MiSeq Whole Genome Sequencing (WGS) data.  
It is designed to be executed **directly on the server** after the BCL-to-FASTQ conversion step.

---

## 📋 Overview

Two analysis pipelines are used in this workflow:

### 🧪 1. QC Pipeline — [BactScout](https://github.com/ghruproject/bactscout)
- Performs automated quality control, contamination screening, and species verification.  
- Generates comprehensive QC summaries (read quality, coverage, contamination, MLST, etc.).  
- Outputs a consolidated QC report in JSON/TSV format and MultiQC-style HTML summaries.  
- **Manual interpretation is required** — you must decide which samples have passed QC and are suitable for assembly.

### 🧬 2. Assembly Pipeline — [GHRU-Assembly](https://github.com/ghruproject/GHRU-assembly)
- Performs de novo genome assembly on samples that pass QC.  
- Uses the Shovill/SPAdes workflow optimized for bacterial genomes.  
- Produces high-quality assemblies and associated metrics (N50, contig count, genome size, etc.).  
- Organizes output by run and sample name for traceability.

---

## 🧠 Notes

- Results are stored automatically under:
* -Output_dir
* qc/
* assembly/




## ⚙️ Initial Notebook Setup (Run Only Once)

These steps should be executed **only once**, when setting up this notebook environment on a new server.  
They create a dedicated Python environment for running Jupyter notebooks.

```bash
# Create a new conda/mamba environment with Python 3.10 or higher
mamba create -n python python>=3.10 jupyterlab nbconvert pandoc -y

# Activate the environment
mamba activate python

# Install the ipykernel package to register this environment with Jupyter
pip install ipykernel -U --user

# After installation, launch Jupyter and select this environment
# as the kernel in the notebook interface.
```

## ⚙️ Step 0: Setup Pipelines and Conda Environments

Before running the QC and Assembly workflows, make sure the required pipelines and conda environments are installed on the server.  
This step only needs to be done **once per system setup**.

---

### 🧩 1. Create and Install Required Conda Environments

Use the following commands to create isolated environments for **BactScout** and **GHRU-Assembly**.

```bash
# Create environment for BactScout (Pixi-based QC)
mamba create -n pixi python=3.10 -y
mamba run -n pixi pip install pixi

# Create environment for GHRU-Assembly (Nextflow-based assembly)
mamba create -n nextflow python=3.10 -y
mamba run -n nextflow mamba install -c bioconda nextflow -y


# Create directory if not already present
mkdir -p pipelines/
cd pipelines/

git clone https://github.com/ghruproject/bactscout.git


git clone https://github.com/ghruproject/GHRU-assembly.git
```

In [None]:
# %%Define the base directory for the current sequencing run
# Define input and output directories for the current sequencing run

#The inpuut directory should contain fastq files in the format: SAMPLEID_{1/R1}.fastq.gz and SAMPLEID_{2/R2}.fastq.gz
input_dir = "/data/nihr/ghru2/2025/2025-07-03/fastqs"  # 🔧 <-- change this for each run
output_dir = "/data/nihr/ghru2/2025/2025-07-03"   # 🔧 <-- change this for each run

# Define subdirectories used in this notebook
qc_dir = f"{output_dir}/qc"
assembly_dir = f"{output_dir}/assembly_out"

# Create QC and assembly directories if they don't exist
!mkdir -p {qc_dir} {assembly_dir}

print(f"📁 Input directory  : {input_dir}")
print(f"📂 Output directory : {output_dir}")
print(f"🧪 QC output dir   : {qc_dir}")
print(f"🧬 Assembly dir    : {assembly_dir}")


In [None]:
# %%
# 🧪 Step 2: Run BactScout QC pipeline inside the 'pixi' environment
bactscout_dir = "/data/nihr/nextflow_pipelines/bactscout" 
# Define input and output paths
#add additional organisms in the config file if needed
threads = 12

print(f"🚀 Running BactScout QC...")
print(f"Input FASTQs : {input_dir}")
print(f"Output Dir   : {qc_dir}")
print(f"Threads Used : {threads}\n")

# Run BactScout QC from inside the pipeline directory using 'mamba run'
cmd = f"""
mamba run -n pixi bash -c '
cd "{bactscout_dir}" &&
pixi run bactscout qc -o "{output_dir}" -t {threads} "{input_dir}"
'
"""

# Execute the command
!{cmd}


## 🧾 Step 3: Manual QC Review and Assign Overall Status

After the BactScout QC pipeline finishes, a file named **`final_summary.csv`** will be generated inside the QC output directory:

```bash
{qc_dir}/final_summary.csv
```


This file contains automated quality flags for each sample, with columns such as:

```csv
sample_id, alt_coverage_status, contamination_status, coverage_status, gc_content_status
```

Each column represents the QC result for a specific check performed by the pipeline.

---

### 🧪 What You Need to Do

1. **Open the file**  
   Navigate to the QC output folder and open `final_summary.csv`.

2. **Add a new column** named **`overall_status`** immediately after the `sample_id` column.

3. For each sample, manually review the QC metrics and assign:
   - `PASSED` → if all individual QC parameters (`alt_coverage_status`, `contamination_status`,  
     `coverage_status`, `gc_content_status`) are acceptable  
     **and** there are no issues observed in the BactScout or MultiQC reports.
   - `FAILED` → if any QC metric indicates poor quality, contamination, or insufficient coverage.

---

### 🧠 Notes

- Use both **automated QC flags** and **visual interpretation** of the reports (e.g., coverage plots, GC content, contamination check) to make your decision.
- Only samples with `overall_status = PASSED` should proceed to the assembly pipeline.
- Save the updated file as **`final_summary.csv`**, overwriting the existing file in the same directory.

This manual review step ensures that only **high-quality and contamination-free samples**  
move forward for genome assembly, improving the overall data quality and reliability.


In [None]:
# %%
# 🧩 Step 4: Generate samplesheet for PASSED samples using QC results

import os

# Define paths
qc_summary = f"{qc_dir}/qc/final_summary.csv"
script_path = f"./scripts/generate_passed_samplesheet.sh"   # path to your script

# Output file will be generated inside the assembly directory
output_samplesheet = f"{output_dir}/samplesheet_passed.csv"

print(f"🚀 Running generate_passed_samplesheet.sh")
print(f"FASTQ directory : {input_dir}")
print(f"QC summary file : {qc_summary}")
print(f"Script path     : {script_path}")
print(f"Output file     : {output_samplesheet}\n")

# Run the script from within the notebook
!bash {script_path} {input_dir} {qc_summary}

# Move generated file (if created in current directory) to the assembly folder
if os.path.exists("samplesheet_passed.csv"):
    !mv samplesheet_passed.csv {output_samplesheet}
    print(f"✅ Moved samplesheet to {output_samplesheet}")
else:
    print("⚠️ samplesheet_passed.csv not found. Please check script output.")


In [None]:
# %%
# 🧬 Step 5: Run GHRU-Assembly pipeline using Nextflow

# Define key paths
NEXTFLOW_PIPELINE_DIR = "/data/nihr/nextflow_pipelines/GHRU-assembly"
samplesheet = f"{output_dir}/samplesheet_passed.csv"
workdir = f"{output_dir}/work"


# Print paths for clarity
print(f"🚀 Running GHRU-Assembly Nextflow pipeline")
print(f"Pipeline dir : {NEXTFLOW_PIPELINE_DIR}")
print(f"Samplesheet  : {samplesheet}")
print(f"Output dir   : {assembly_dir}")
print(f"Work dir     : {workdir}\n")

# Run the Nextflow pipeline inside the 'nextflow' mamba environment
!mamba run -n nextflow bash -c "nextflow run {NEXTFLOW_PIPELINE_DIR}/main.nf \
    --samplesheet {samplesheet} \
    --outdir {assembly_dir} \
    -w {workdir} \
    -resume"

In [None]:
# %%
# 🧹 Step 7: Cleanup — remove Nextflow work directory and log files

# Run cleanup commands safely inside the run's base directory
!rm -rf {output_dir}/work && rm -f {output_dir}/.nextflow*.log

print(f"🧹 Cleanup complete — removed 'work/' and any '.nextflow*.log' files from {output_dir}")
