<h2>Workflow for processing raw nanopore sequencing output</h2>

<h3>Step 1 – Basecalling on raw pod5 files (if basecalling was not performed in real time)</h3>

<p>
<em>
Requires Dorado downloaded and installed; see
<a href="https://github.com/nanoporetech/dorado">https://github.com/nanoporetech/dorado</a>
for instructions and help
</em>
</p>
Here you are performing basecalling on your raw nanopore signal data (stored in pod5 files) and likely aligning the reads to a reference genome. The output is a bam file. If you also want a fastq of the reads, add --emit-fastq as an argument<p>

<p>When selecting your basecalling model consider:</p>

<ul>
  <li><strong>Speed / accuracy types:</strong> fast | hac | sup</li>
  <li><strong>Type of bases:</strong> canonical | modified</li>
</ul>

In [None]:
# look up basecalling models if you need to
! dorado --list-models

In [None]:
# basecalling
! dorado basecaller [basecalling model id] [path to pod5] --reference [path to reference genome] > output.bam

<h3>Step 2 – Sort and index BAM file, get coverage statistics</h3>
<p>
<em>
Requires samtools; see:
<a href="https://www.htslib.org/doc/">https://www.htslib.org/doc/</a>
for instructions and help
</em>
</p>

In [None]:
# sort
! samtools sort -o [output.sorted.bam] [output.bam]

In [None]:
# index
! samtools index [output.sorted.bam]

In [None]:
# coverage statistics
! samtools coverage [output.sorted.bam]

<h3>Step 3 – Get counts for modified base positions (if you used modified basecalling)</h3>

<p>
<em>
Requires Modkit installed; see
<a href="https://github.com/nanoporetech/modkit">https://github.com/nanoporetech/modkit</a>
for instructions and help
</em>
</p>
This aggregates reads on your modified base and counts the number of modified and unmodified calls per position. Output is a bed file with methylation summaries.<p>

In [None]:
# modified base data
! modkit pileup [output.sorted.bam] [output.bedmethyl]

<h2>Other useful commands</h2>

<h3>Aligning reads that have already been basecalled</h3>
<p>
Do this for unaligned fastq or bam files; dorado uses minimap2 for aligning reads to reference genome
</p>

In [None]:
# align reads to reference
! dorado aligner [reference genome] [reads.fastq] > aligned.bam 

<h3>Demultiplexing pooled barcode reads</h3>
<p>
If you didn't do real time basecalling during sequencing, you will have to do it prior to demultiplexing. After basecalling (see Step 1), use this command to sort your fastq basecalled reads by sample barcode. The output is a bam file for each barcode. It will be written to a directory you specifiy (or create a new directory if a desired one doesn't exist).
</p>

In [None]:
# demultiplexing reads
! dorado demux [reads.fastq] --kit-name [sequencing kit code] --output-dir [/demux_reads]

<h3>Grab list of read id's for a specific barcode</h3>
<p>
Do this if you need to subset reads from a pod5 that contains all barcoded samples. This grabs read id's from a bam file containing reads for a specific barcode. 
</p>

In [None]:
# grab read id by barcode
! samtools view [barcode07.bam] | cut -f1 | sort -u > read_ids.txt

<h3>Subset a pod5 file to reads for a specific barcode</h3>
<p>
<em>
Requires pod5 package installed; see
<a href="https://software-docs.nanoporetech.com/pod5/latest/tools/">https://software-docs.nanoporetech.com/pod5/latest/tools/</a>
for instructions and help
</em>
<p>
Do this to generate a new pod5 file with only a barcodes for a specific sample. Helpful if basecalling needs to be done separately for a specific barcoded sample. You will need your list of read id's, one per line (see above command). Name of output file must end in the .pod5 extension for dorado to recognize it later. Use --missing-ok arg to specificy that its ok that not every original pod5 file will have read id's for your barcode, otherwise the command will fail. --output can also be a new directory if you want to separate it (usually a good idea since you point dorado to directories of pod5 files to execute commands)
</p>

In [None]:
# subset pod5
! pod5 filter [pod5_directory/] --ids [read_ids.txt] --output [new_pod5_directory/subset.pod5] --missing-ok

<h3>Generate a consensus sequence for a targeted region of genome from reads that are aligned to a reference genome</h3>
<p>
This workflow shows an example for generating a consensus sequence for an RM gene in the B331 strain found on lp25 with the coordinates 423-4238. The fasta id for lp25 on the reference genome is B331_CP017212_116_lp25. 
</p>
Step 1 - subset bam file for only reads in the region of interest and generate a new consensus sequence
<ul>
  <li>Exact id from reference fasta needs to be used, followed by ":start-end"</li>
  <li>subset > sort > index > make region specific ref seq > check number of aligned reads</li>
</ul>
<p>
<p>
Step 2 - Call variantws and get consensus sequence
<ul>
   <li>variant call > zip vcf file > index vcf file > generate consensus sequence</li>
   <li>*Note -Ov specifies plain VCF output rather than compressed; I think the newest version of bcftools can take uncompressed as input for consensus, but I ended up needing to compress</li>
</ul>


In [None]:
### step 1 samtools workflow ###
# subset for reads overlapping region of interest
samtools view -b B331.sorted.bam B331_CP017212_116_lp25:423-4238 > gene_reads.bam

# sort
samtools sort -o gene_reads.sorted.bam gene_reads.bam

# index
samtools index gene_reads.sorted.bam

# create reference seq for region of interest
samtools faidx B331.fasta B331_CP017212_116_lp25:423-4238 > gene_ref.fasta

# if you want to know how many reads in the new bam
samtools view -c gene_reads.sorted.bam

In [None]:
### step 2 vcftools workflow ###
# call variants
bcftools mpileup -f B331.fasta gene_reads.sorted.bam -Ov -o gene.vcf

# zip 
bgzip gene.vcf

# index
bcftools index gene.vcf.gz

# get consensus seqeunce
bcftools consensus -f gene_ref.fasta gene.vcf.gz > consensus.fasta