 ### IGV, Interactive Genome Viewer.
 Have you used IGV before? Probably. It's an awesome way to see your reads stacked up against a reference!
 
 Using the E coli reference genome we will compare the output of the different chemistries, the basecalling algorithms used for the same chemistry and the reads of a nanopolished file.
 
 IGV is a great way to see what type of errors are seen in the nanopore. Homopolymers can cause issues as you should be able to observe. Most of the errors in nanopore reads can be seen to be insertion/deletion errors, an issue when determining if a change in the current is due to a nucleotide or just noise.

In [2]:
import os, subprocess, re
from igv import IGV, Reference, Track

In [12]:
# Show paths to the bam files
reference_file = "references/e_coli_k12_mg1655/NC_000913.fna"
bam_dir = "/mnt/shared/PoreCampAU/data/alignment/e_coli_R9/"
my_alignment_dir = "/home/researcher/alignment/"
# List the set of files in the bam directory
for dirpath, dirnames, filenames in os.walk(bam_dir):
    if len(filenames) == 0:  # empty folder
        continue
    for filename in filenames:
        if not filename.endswith(".bam"):  # Not a bam file, maybe an index file.
            continue
        print dirpath + "/" + filename

/mnt/shared/PoreCampAU/data/alignment/e_coli_R9/metrichor2d_nanopolish/nanopolish_e_coli_output.sorted.bam
/mnt/shared/PoreCampAU/data/alignment/e_coli_R9/metrichor/2016-11-14_E_COLI_R9_bwa-mem.sorted.bam
/mnt/shared/PoreCampAU/data/alignment/e_coli_R9/nanonet2d/2016-11-15_E_COLI_R9_bwa-mem.sorted.bam


In [8]:
IGV(locus="")

It's easy to visually see the differences in quality between each chemistry and alignment algorithm.
However, quantitative metrics are often easier to explain to someone.
To do this, we'll use the stats module of samtools to generate a stats report from the bam file.

In [21]:
bam_file = "/mnt/shared/PoreCampAU/data/alignment/e_coli_R9/nanonet2d/2016-11-15_E_COLI_R9_bwa-mem.sorted.bam"
stats_file = my_alignment_dir + "e_coli_R9_metrichor_stats.txt"  # rename this for each bam file to stop overwriting.
samtools_stats_command = "samtools stats %s > %s" % (bam_file, stats_file)
stderr = subprocess.check_call(samtools_stats_command, shell=True, stderr=subprocess.STDOUT)

if not stderr=="":
    print "Stderr = %s" % stderr

Stderr = 0


Cool, now this file is particularly big for a summary sheet.
Fortunately it's sorted into components that we can extract using the grep command.
...we could also use python, because python is beautiful.

In [22]:
stats_file_handler = open(stats_file, 'r')
for line in stats_file_handler:
    if line.startswith("SN\t"):
        print(line.rstrip()) # rstrip gets rid of the \n at the end of the line.
stats_file_handler.close()

SN	raw total sequences:	91574
SN	filtered sequences:	0
SN	sequences:	91574
SN	is sorted:	1
SN	1st fragments:	91574
SN	last fragments:	0
SN	reads mapped:	83319
SN	reads mapped and paired:	0	# paired-end technology bit set + both mates mapped
SN	reads unmapped:	8255
SN	reads properly paired:	0	# proper-pair bit set
SN	reads paired:	0	# paired-end technology bit set
SN	reads duplicated:	0	# PCR or optical duplicate bit set
SN	reads MQ0:	269	# mapped and MQ=0
SN	reads QC failed:	0
SN	non-primary alignments:	0
SN	total length:	371874194	# ignores clipping
SN	bases mapped:	358737834	# ignores clipping
SN	bases mapped (cigar):	338367121	# more accurate
SN	bases trimmed:	0
SN	bases duplicated:	0
SN	mismatches:	50379129	# from NM fields
SN	error rate:	1.488889e-01	# mismatches / bases mapped (cigar)
SN	average length:	4060
SN	maximum length:	11409
SN	average quality:	255.0
SN	insert size average:	0.0
SN	insert size standard deviation:	0.0
SN	inward oriented pairs:	0
SN	outward oriented pairs:	0