
# Long‑Read Basecalling on the OSPool (ONT → Dorado, GPU)

**Goal:** Basecall Oxford Nanopore **POD5** files into **BAM/FASTQ** using **Dorado (GPU)** on the **OSPool**, organized as a scalable, HTCondor‑driven workflow.  
**Data locality:** Input and outputs are staged via the **Open Science Data Federation (OSDF)**.  
**Compute:** GPU‑enabled OSPool execution points.

> This notebook mirrors the structure and style of the Minimap2 tutorial you provided, but focuses **only** on the *basecalling* stage.



## Learning outcomes

By the end, you can:
- Organize ONT inputs (POD5) and model archives for **Dorado** on the OSPool.
- Create an **HTCondor** submit file for GPU basecalling at scale.
- Launch, monitor, and debug jobs.
- Collect basecalled outputs from OSDF.



## Prerequisites

- OSPool account/access (OSG/CHTC).
- Basic command‑line skills (Bash, file paths).
- HTCondor basics (submit files, logs).
- Your ONT raw data in **POD5** format and a Dorado model tarball (e.g., `dna_r10.4.1_e8.2_400bps_hac@v5.2.0.tar.gz`).
- A working Dorado container or the tutorial container environment.


## Understanding Basecalling in Oxford Nanopore Sequencing

In Oxford Nanopore Technologies (ONT) sequencing, the instrument does not directly read DNA or RNA bases. Instead, it measures subtle changes in ionic current as a single-stranded molecule passes through a nanopore embedded in a membrane. These electrical current traces—often called raw signals or squiggles—reflect the sequence-dependent resistance patterns of the nucleotides inside the pore.

### From Signal to Sequence: The Role of Basecalling

The process of converting these continuous electrical signals into nucleotide sequences (A, C, G, T or U) is called basecalling. Basecalling algorithms interpret the temporal and amplitude patterns in the current signal to infer the most probable underlying sequence of bases. This is one of the most computationally demanding and algorithmically sophisticated parts of the ONT analysis pipeline.

Dorado uses deep neural network models trained to map signal patterns to the most probable base sequence. These models are computationally intensive, relying on GPU acceleration to perform millions of operations per second. GPUs dramatically shorten inference time by parallelizing the math that drives neural network prediction, enabling accurate and high-throughput decoding of long-read data.

Because each read or POD5 file can be basecalled independently, Dorado workflows scale perfectly on the OSPool, where thousands of GPU-enabled jobs can run simultaneously. The OSPool’s distributed architecture provides the compute, memory, and data-staging infrastructure (via OSDF) needed to handle large sequencing runs reproducibly and efficiently, turning hours of local computation into minutes of parallel basecalling across the national HTC fabric.

## Basecalling on the OSPool by Sequencing Channel

When performing simplex basecalling with Dorado, it’s often advantageous to reorganize your raw data by sequencing channel before submitting jobs to the OSPool. Each Oxford Nanopore flow cell consists of hundreds to thousands of channels that generate independent signal traces, meaning the data within a single POD5 file can be subdivided into smaller, channel-specific subsets.

By splitting the data so that each channel’s reads reside in their own POD5 file, we enable truly parallel basecalling: each channel file becomes an independent job that can run simultaneously across hundreds or thousands of OSPool execution points. This design perfectly aligns with the principles of High Throughput Computing (HTC)—many small, independent jobs working together to accelerate large workflows.

We use the POD5 package available inside the dorado.sif container to generate per-channel subsets and organize them into a split_pod5_subsets directory. Once subdivided, each of these new POD5 files can be basecalled individually, dramatically reducing time-to-results while maintaining full reproducibility and scalability on the OSPool.


## Directory structure (recommended)

We assume a layout similar to this (directories may already exist if you used the companion repository):

```
ont-basecalling/
├── executables/               # helper scripts
├── inputs/                    # POD5 inputs (or OSDF paths to them)
├── logs/                      # condor .log/.out/.err
├── outputs/                   # basecalled outputs (BAM/FASTQ)
├── software/                  # containers, model tarballs, helper assets
├── list_of_pod5_files.txt     # one POD5 path per line
├── run_dorado.sub             # HTCondor submit file (GPU)
└── tutorial-setup.sh          # optional helper to set up the tree
```

You will also have a companion **OSDF** directory for storing large files, such as containers and Dorado models. The directory structure should look like:

```
/ospool/<ap##>/data/<your-username>/tutorial-ONT-Basecalling/

├── data                       # Dorado model tarballs
├── ├── dna_r10.4.1_e8.2_400bps_fast@v4.2.0_5mCG_5hmCG@v2.tar.gz
├── ├── dna_r10.4.1_e8.2_400bps_fast@v4.2.0.tar.gz
├── ├── ...
├── ├── rna004_130bps_sup@v5.2.0.tar.gz
├── software/                  # Dorado container (or osdf:/// path)
└── ├── dorado_build1.2.0_27OCT2025_v1.sif
```

We've included a `tutorial-setup.sh` script in the companion repository to create this structure for you. To run it, execute:

In [None]:
./tutorial-setup.sh 


## OSDF data paths

For large datasets, place/read inputs via **OSDF** and write outputs back to OSDF. Example patterns:

- OSDF:
  `/ospool/<ap##>/data/<your-username>/tutorial-ONT-Basecalling/`

You can also stage a **models** tarball on OSDF and fetch it at job start.



## Prepare the list of POD5 files

List one absolute OSDF path per line in `list_of_pod5_files.txt`. This enables scalable, itemized submission.


In [None]:
# Create a starter list file (edit paths to your inputs).
# You can paste more lines or generate this programmatically.
%%bash
cat > list_of_pod5_files.txt << 'EOF'
/ospool/uc-shared/public/example-ont/pod5/sample01/channel-100.pod5
/ospool/uc-shared/public/example-ont/pod5/sample01/channel-101.pod5
# Add more POD5 files, one per line...
EOF

echo "Wrote $(wc -l < list_of_pod5_files.txt) entries to list_of_pod5_files.txt"
head -n 3 list_of_pod5_files.txt


## Create an executable wrapper for Dorado

This wrapper untars the Dorado model **per job**, runs Dorado basecalling on one POD5 file, and emits a BAM (and optionally FASTQ).  
Adapt the `DORADO_ARGS` and `DORADO_MODEL_TARBALL` names to your environment.


In [None]:
%%bash
mkdir -p executables outputs logs software

cat > executables/run_dorado.sh << 'EOS'
#!/usr/bin/env bash
set -euo pipefail

# Positional args
DORADO_ARGS="$1"           # e.g., 'basecaller --device cuda:all --batchsize 16 hac@v5.0.0'
DORADO_MODEL_TARBALL="$2"  # e.g., 'dna_r10.4.1_e8.2_400bps_hac@v5.2.0.tar.gz'
INPUT_POD5="$3"            # one .pod5 file (from list_of_pod5_files.txt)

# Derive filenames
BAM_FILE="${INPUT_POD5##*/}.bam"   # channel-100.pod5.bam
FASTQ_FILE="${BAM_FILE}.fastq"     # optional

echo "[INFO] Host: $(hostname)"
echo "[INFO] GPU(s): ${CUDA_VISIBLE_DEVICES:-'none'}"
echo "[INFO] Input POD5: ${INPUT_POD5}"
echo "[INFO] Dorado args: ${DORADO_ARGS}"
echo "[INFO] Model tarball: ${DORADO_MODEL_TARBALL}"

# Model extraction (assumes tarball in the job sandbox)
tar -xzf "${DORADO_MODEL_TARBALL}"
rm -f "${DORADO_MODEL_TARBALL}"
echo "[INFO] Model extracted."

# Run Dorado (produces BAM on stdout)
# Example DORADO_ARGS:
#   "basecaller --device cuda:all --batchsize 16 hac@v5.0.0"
# Replace with your specific model alias, e.g. hac@v5.2.0
set -x
dorado ${DORADO_ARGS} "${INPUT_POD5}" > "${BAM_FILE}"
set +x
echo "[INFO] Dorado basecalling complete: ${BAM_FILE}"

# Optional: convert BAM -> FASTQ
# samtools fastq "${BAM_FILE}" > "${FASTQ_FILE}"
# echo "[INFO] Generated FASTQ: ${FASTQ_FILE}"

# Stage outputs (this directory is transferred back or written to OSDF mount by HTCondor setup)
mkdir -p outputs
mv -v "${BAM_FILE}" outputs/
# mv -v "${FASTQ_FILE}" outputs/  # if enabled
EOS

chmod +x executables/run_dorado.sh
echo "Created executables/run_dorado.sh"


## HTCondor submit file (GPU)

This submit file requests a GPU slot, transfers in the model tarball and executable, and runs one POD5 per job.


In [None]:
%%bash
cat > run_dorado.sub << 'EOS'
# Dorado basecalling on OSPool (GPU)
universe                = vanilla
executable              = executables/run_dorado.sh

# Arguments: "<DORADO_ARGS>" "<MODEL_TARBALL>" "<INPUT_POD5>"
# Example args use all visible GPUs; adapt to your device setup and model alias.
arguments               = "basecaller --device cuda:all --batchsize 16 hac@v5.0.0" "dna_r10.4.1_e8.2_400bps_hac@v5.2.0.tar.gz" "$(ITEM)"

# Input list (one pod5 per line)
queue ITEM from list_of_pod5_files.txt

# Files to transfer in with each job (executable is sent automatically)
transfer_input_files    = software/dorado.sif, software/dna_r10.4.1_e8.2_400bps_hac@v5.2.0.tar.gz

# Use Apptainer/Singularity container for Dorado
# Adjust to your OSDF image path if preferred (e.g., osdf:///ospool/.../dorado.sif).
+SingularityImage       = "software/dorado.sif"

# GPU request and resource hints
request_gpus            = 1
request_cpus            = 2
request_memory          = 8 GB
request_disk            = 5 GB

# Nice-to-have scheduling hints (adapt to your pool attributes)
requirements            = (CUDACapability >= 7.0) && (HasDocker == false)

# Logs
log                     = logs/$(Cluster).log
output                  = logs/$(Cluster).$(Process).out
error                   = logs/$(Cluster).$(Process).err

# File transfer behavior
should_transfer_files   = YES
when_to_transfer_output = ON_EXIT

# Keep job sandbox small
# periodic_remove = (JobStatus == 5) && (time() - QDate > 86400)
EOS

echo "Created run_dorado.sub"


### Place container and model on OSDF (recommended)

For large classes, store the **Dorado container** and **model tarball** on OSDF and reference them via `osdf:///` URLs in `transfer_input_files`.  
In this notebook we referenced local `software/dorado.sif` and a local model tarball for simplicity.



## Submit jobs


In [None]:

%%bash
# Dry-run: print the first few expanded queue items for sanity
echo "Preview first 2 jobs:"
head -n 2 list_of_pod5_files.txt | while read -r p; do
  echo condor_submit run_dorado.sub "with ITEM=$p"
done

# Actual submission (uncomment to run on your site)
# condor_submit run_dorado.sub



## Monitor jobs


In [None]:

%%bash
# Common monitoring commands
echo "Queue summary:"
condor_q -autoformat ClusterId ProcId JobStatus RequestGPUs | head -n 20 || true

echo
echo "GPU slots overview (sample):"
condor_status -compact -constraint 'CUDACapability>=7.0' -af Name CUDACapability GPUs_Total GPUs_Allocated || true



## Inspect logs and failures

Look at `.out/.err` for Dorado error messages (e.g., model alias mismatch, no visible GPUs, CUDA driver mismatch).  


In [None]:

%%bash
# Tail a recent job's output/error (replace IDs accordingly)
# tail -n 100 logs/<CLUSTER>.<PROC>.out
# tail -n 100 logs/<CLUSTER>.<PROC>.err
echo "Replace <CLUSTER> and <PROC> with your IDs to inspect logs."



## Collect outputs

Successful jobs write BAMs into `outputs/`. You can stage these back to OSDF or keep locally for downstream steps.


In [None]:

%%bash
mkdir -p outputs
ls -lh outputs || true



## Troubleshooting

- **`Invalid CUDA device index` / `0 visible CUDA devices`**: Ensure GPU slots are selected and container sees GPUs. Check `CUDACapability` and `nvidia-smi` availability.
- **Model alias mismatch**: Confirm the `hac@...` / `sup@...` alias matches the model files within your tarball.
- **Large model tarball transfer**: Host the model on **OSDF** to avoid per‑job upload from submit host.
- **Throughput**: Tune `--batchsize`, `request_cpus`, and ensure enough GPU memory. Use job splitting on the POD5 granularity.



## Cleanup (optional)

Remove temporary logs or partial outputs if you are iterating frequently.


In [None]:

%%bash
# rm -rf logs/* outputs/*  # uncomment with caution
echo "Nothing deleted. Uncomment the rm lines to clean."



## Next steps

- Proceed to **read mapping** (Minimap2) using your basecalled BAM/FASTQ files.
- Optionally integrate structural variant calling (Sniffles2) as a new stage.
- Convert this workflow into a DAGMan pipeline for multi‑stage orchestration.
