#### Alignment:
The good news: Oxford nanopore reads are extrememly long. Repeat regions are no match for long reads, the variable flanking regions around the repeat can determine where in the genome this read belongs. It also means that any ambiguity in alignment is cleared up through other regions in the read.
The bad news: Oxford nanopore reads lack accuracy compared to other second generation sequencing technologies.
Therefore, seed-and-extend algorithms have had to change tack a bit to allow for more mismatches in the seed before dismissing a read entirely.
We will use a bwa mem variant that has the parameters optimised for Oxford Nanopore reads.
This tutorial will take you through the basics of converting a set of fastq files to a sam file to a sorted bam file.
It is important for bam files to be sorted for many downstream analysis.
It's "sorted" in the computer way. Bam files are sam files in binary format so it isn't easy to show how the end result.
We can use the fastq files that we extracted from our fast5 files and compare the accuracies over the different chemistries.

In [1]:
# Import the modules we need.
import os
import subprocess
from Bio import Entrez, SeqIO

In [2]:
# Set the directories:
SOURCE_DIR = "/mnt/shared/PoreCampAU/"
HOME_DIR = "/home/researcher/"
FASTQ_DIRECTORY = SOURCE_DIR + "data/fastq/"
ALIGNMENT_DIRECTORY = HOME_DIR + "alignment/"

# Create the alignment directory if it doesn't already exist.
if not os.path.isdir(ALIGNMENT_DIRECTORY):
    os.mkdir(ALIGNMENT_DIRECTORY)

In [3]:
# Download the reference 
from Bio import Entrez, SeqIO

# Use your own email here
Entrez.email = "alexiswl@student.unimelb.edu.au"

# Create reference directory and file name.
reference_directory = "/home/researcher/references/"
if not os.path.isdir(reference_directory):
    os.mkdir(reference_directory)
reference_name = "Escherichia_coli_k12_MG1655"
reference_file = reference_directory + reference_name + ".fa"
uid = "U00096.3"  # This is the uid for E coli genome strain K-12 MG1655
handle = Entrez.efetch(db="nucleotide", id=uid, rettype="fasta")
fasta_handler = SeqIO.read(handle, "fasta")
reference_handler = open(reference_file, "w")
SeqIO.write(fasta_handler, reference_handler, "fasta")
reference_handler.close()  # Always close the door behind you.

In [4]:
# Use samtools to generate a fasta index for the reference file.
command = "samtools faidx %s" % reference_file
stderr = subprocess.check_output(command, shell=True, stderr=subprocess.STDOUT)

if not stderr == "":
    print(stderr)

In [5]:
# Before we run bwa mem, we will also need to run bwa index on the reference file
bwa_index_command = "bwa index %s" % reference_file
stderr = subprocess.check_output(bwa_index_command, shell=True, stderr=subprocess.STDOUT)

if not stderr == "":
    print("Stderr:\n%s" % stderr)

Stderr:
[bwa_index] Pack FASTA... 0.07 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 1.51 seconds elapse.
[bwa_index] Update BWT... 0.05 sec
[bwa_index] Pack forward-only FASTA... 0.05 sec
[bwa_index] Construct SA from BWT and Occ... 0.85 sec
[main] Version: 0.7.15-r1140
[main] CMD: bwa index /home/researcher/references/Escherichia_coli_k12_MG1655.fa
[main] Real time: 3.383 sec; CPU: 2.539 sec



In [30]:
# Now we can align the genome using bwa-mem
# The output is in sam format and printed to the command line.
# We can redirect the output using the > sign.
SAMPLE_NAME = "e_coli_R9"

fastq_file = FASTQ_DIRECTORY + "e_coli_R9/metrichor/pass/2d/2016-11-07_E_COLI_R9_pass.2d.fastq"
sam_file = ALIGNMENT_DIRECTORY + SAMPLE_NAME + ".sam"
bam_file = ALIGNMENT_DIRECTORY + SAMPLE_NAME + ".bam"
sorted_bam_file = ALIGNMENT_DIRECTORY + SAMPLE_NAME + ".sorted.bam"
sorted_bam_file_index = ALIGNMENT_DIRECTORY + SAMPLE_NAME + ".sorted.bai"

bwa_command = "bwa mem -x ont2d %s %s > %s" % (reference_file, fastq_file, sam_file)
stderr = subprocess.check_output(bwa_command, shell=True, stderr=subprocess.STDOUT)

if not stderr == "":
    print("Stderr:\n%s" % stderr)

Stderr:
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 1792 sequences (10014122 bp)...
[M::process] read 1778 sequences (10007342 bp)...
[M::mem_process_seqs] Processed 1792 reads in 71.049 CPU sec, 71.289 real sec
[M::process] read 1770 sequences (10000980 bp)...
[M::mem_process_seqs] Processed 1778 reads in 69.119 CPU sec, 69.299 real sec
[M::process] read 1778 sequences (10005590 bp)...
[M::mem_process_seqs] Processed 1770 reads in 74.745 CPU sec, 74.945 real sec
[M::process] read 1766 sequences (10003422 bp)...
[M::mem_process_seqs] Processed 1778 reads in 69.057 CPU sec, 69.208 real sec
[M::process] read 1812 sequences (10001792 bp)...
[M::mem_process_seqs] Processed 1766 reads in 68.853 CPU sec, 69.069 real sec
[M::process] read 1762 sequences (10010250 bp)...
[M::mem_process_seqs] Processed 1812 reads in 74.852 CPU sec, 75.099 real sec
[M::process] read 1776 sequences (10000860 bp)...
[M::mem_process_seqs] Processed 1762 reads in 80.121 CPU sec, 80.436 real sec

In [32]:
# Now let's turn that sam file into a bam file.
sam_to_bam_command = "samtools view -b %s -o %s" % (sam_file, bam_file)
stderr = subprocess.check_output(sam_to_bam_command, shell=True, stderr=subprocess.STDOUT)

if not stderr == "":
    print("Error: ", stderr)

In [33]:
# Now sort the sam file
sort_bam_command = "samtools sort -o %s %s" % (sorted_bam_file, bam_file)
stderr = subprocess.check_output(sort_bam_command, shell=True, stderr=subprocess.STDOUT)

if not stderr == "":
    print("Error: ", stderr)

In [34]:
# Now index the bam file
index_bam_command = "samtools index %s %s" % (sorted_bam_file, sorted_bam_file_index)
stderr = subprocess.check_output(index_bam_command, shell=True, stderr=subprocess.STDOUT)

if not stderr == "":
    print("Error: ", stderr)