`SAM` files are Sequencing/Alignment files.  `BAM` files are binary versions of `SAM` files.  File specifications can be found [here](http://samtools.github.io/hts-specs/SAMv1.pdf).

`SAM` and `BAM` files should be sorted by position and indexed, and that the index files follows a specific naming convention. Specifically,  a `BAM` index file should be named by appending `.BAI` to the bam file name. A `SAM` index filename is created by appending `.SAI`.  

__Tutorials__:

- [Working with Genomic Data in Python](http://fullstackdatascientist.io/15/03/2016/genomic-data-visualization-using-python/)
- [Framework for Evaluating Variant Detection Methods](https://bcb.io/2013/05/06/framework-for-evaluating-variant-detection-methods-comparison-of-aligners-and-callers/)

__Python packages__:

- [bamnostic](https://github.com/betteridiot/bamnostic): `BAM`
- [pybedtools](https://github.com/daler/pybedtools): `BAM`, `SAM`, `BED`, `BCF`, `VCF`, `GFF`, and `GTF` 
- [htseq](https://github.com/simon-anders/htseq): `FASTQ`, `BAM`, `SAM`, `VCF`, `BED`
- [GenomeView](https://github.com/nspies/genomeview): `BAM`

__What can we look at?__:

- __Coverage histogram in amplicon region__ :  This is a distribution of coverage inside the amplicon regions. The function parses region by region and get the coverage of all the samples provided, then render a histogram with X axis being the coverage

- __Cumulative coverage saturation plot__:  This is the same as before, instead we render a cumulative histogram, it can be useful, you can see how bad the NSG sample renders for example

- __Mapping qualities in amplicon regions__: This is a stacked histogram that renders the mapping qualities inside the amplicon regions, we can see that NSG is again failing and that we have a mapping qualities around 42 in this example in particular

- ~~__Targeted regions coverage__:  This is the so called CDF function of coverage per sample inside the target regions~~

- ~~__Coverage heatmap inside amplicon regions__: This plot is another view of coverage inside amplicon region, it places all the samples side by side and render all the row as a heatmap, I did 2 examples here, one with a crazy sample like the NSG one and one without~~

- ~~__Allelic frequencies heatmap__: This is a plot showing the allele frequencies change across positions and samples, I tried to cluster the positions per frequencies values but I don’t get ‘clusterable’ positions, so I sorted the positions values across all samples~~

- ~~__Zygosity Matrix__: Dual clustering on positions and sample names grouping positions per zygosity (heterozygote, homozygote wildtype, homozygote mutant)~~

In [1]:
import os

In [2]:
os.getcwd()

'/home/ifrancium/Documents/amplicon-ngs-workflow/notebooks'

In [3]:
os.listdir("./data/original_files")

['aln2.fastq.gz',
 'aln.bam',
 'aln1.fastq',
 'aln.bam.bai',
 'aln1.fastq.gz',
 'aln2.fastq']

In [4]:
os.listdir("./data/samtools_output/amplicons")

['amplicon1.bam',
 'amplicon2.bam.bai',
 'whole-data',
 'amplicon4.bam.bai',
 'amplicon3.bam',
 'amplicon3.bam.bai',
 'amplicon1.bam.bai',
 'amplicon4.bam',
 'amplicon2.bam',
 'amp1_bamflags.txt',
 'subsect']

In [5]:
amplicon_bam_dir = "/home/ifrancium/Documents/amplicon-ngs-workflow/notebooks/data/samtools_output/amplicons/"

In [6]:
os.listdir("./../reference_sequences/Homo_sapiens.GRCh37/chromosomes/")

['Homo_sapiens.GRCh37.dna_sm.chromosome.12.fa.sa',
 'Homo_sapiens.GRCh37.dna_sm.chromosome.7.fa.fai',
 'Homo_sapiens.GRCh37.dna_sm.chromosome.14.fa',
 'Homo_sapiens.GRCh37.dna_sm.chromosome.13.fa.fai',
 'Homo_sapiens.GRCh37.dna_sm.chromosome.17.fa.pac',
 'Homo_sapiens.GRCh37.dna_sm.chromosome.14.fa.bwt',
 'Homo_sapiens.GRCh37.dna_sm.chromosome.19.fa',
 'Homo_sapiens.GRCh37.dna_sm.chromosome.10.fa.ann',
 'Homo_sapiens.GRCh37.dna_sm.chromosome.2.fa.pac',
 'Homo_sapiens.GRCh37.dna_sm.chromosome.MT.fa.gz',
 'Homo_sapiens.GRCh37.dna_sm.chromosome.13.fa.bwt',
 'Homo_sapiens.GRCh37.dna_sm.chromosome.14.fa.sa',
 'Homo_sapiens.GRCh37.dna_sm.chromosome.11.fa',
 'Homo_sapiens.GRCh37.dna_sm.chromosome.11.fa.sa',
 'Homo_sapiens.GRCh37.dna_sm.chromosome.18.fa',
 'Homo_sapiens.GRCh37.dna_sm.chromosome.X.fa.gz',
 'Homo_sapiens.GRCh37.dna_sm.chromosome.10.fa',
 'Homo_sapiens.GRCh37.dna_sm.chromosome.12.fa.ann',
 'Homo_sapiens.GRCh37.dna_sm.chromosome.3.fa.pac',
 'Homo_sapiens.GRCh37.dna_sm.chromosome.2

In [7]:
ref_seq_path = "/home/ifrancium/Documents/amplicon-ngs-workflow/reference_sequences/Homo_sapiens.GRCh37/chromosomes/Homo_sapiens.GRCh37.dna_sm.chromosome.7.fa"
print(ref_seq_path)

/home/ifrancium/Documents/amplicon-ngs-workflow/reference_sequences/Homo_sapiens.GRCh37/chromosomes/Homo_sapiens.GRCh37.dna_sm.chromosome.7.fa


### Visualizing Alignment using `genomeviewer`

In [8]:
import genomeview
from genomeview.bamtrack import SingleEndBAMTrack, PairedEndBAMTrack
import genomeview.axis
import genomeview.graphtrack
from genomeview import genomesource

In [9]:
amplicon1_bam_dir = amplicon_bam_dir + 'amplicon1.bam'
print(amplicon1_bam_dir)

/home/ifrancium/Documents/amplicon-ngs-workflow/notebooks/data/samtools_output/amplicons/amplicon1.bam


In [10]:
track_info = {"Amplicon 1": amplicon1_bam_dir}
chrom = "7"
start = 55242300

In [11]:
tracks = genomeview.visualize_data(track_info, chrom, start, start+300, ref_seq_path)

In [None]:
tracks

__Building tracks__

In [None]:
doc = genomeview.Document(900)

In [None]:
amplicon_start_pos = {
    "1": 55242300,
    "2": 55241500,
    "3": 55259300, 
    "4": 55248900
    }

chromosome = "chr7"

In [None]:
view = genomeview.GenomeView(chromosome, amplicon_start_pos["1"], amplicon_start_pos["1"] + 300, "+", ref_seq_path)
doc.add_view(view)

In [None]:
amplicon1_bam_track = genomeview.PairedEndBAMTrack(amplicon_bam_dir + "amplicon1.bam", name="Amplicon1")
view.add_track(amplicon1_bam_track)

axis_track = genomeview.Axis()
view.add_track(axis_track)

In [None]:
print("There are ", sum(1 for x in aln), " alignment reads in this BAM file.")

In [None]:
sample_size = 5
    
t_start = time.time()

print("Printing the first {} reads in aln:".format(sample_size))

aln_alignment_reads = []

for cnt, alignment_read in enumerate(itertools.islice(aln, sample_size)):
    
    read_dict = {
        "Sequence": str(alignment_read.read),
        "Aligned?": alignment_read.aligned,
        "Coverage": alignment_read.iv,
        # "Paired-end?": alignment_read.paired-end, # This could probably be removed. NOT WORKING
        "Phred Score": alignment_read.aQual,
        "CIGAR Objects": alignment_read.cigar,
        "Secondary Alignment (0x100)": alignment_read.not_primary_alignment,
        "Failed QC? (0x200)": alignment_read.failed_platform_qc,
        "PCR/Optical duplicate? (0x400)": alignment_read.pcr_or_optical_duplicate,
        "Chimeric Alignment (0x800)": alignment_read.supplementary,
        # "Alignment Description": alignment_read.get_sam_lines(), # NOT WORKING
        "Pairs Overlap?": alignment_read.mate_aligned,
        "Both Pairs Aligned? (0x0002)": alignment_read.proper_pair,
        "Which Pair Aligned?": alignment_read.pe_which,
        "Alignment Start Position": alignment_read.mate_start, # should be the same as `.iv.start`\n",
        "Pair Gap Distance": alignment_read.mate_start
        }
    aln_alignment_reads.append(read_dict) 

    # print(cnt, "Read Length: ", len(alignment_read.read), "Average Phred Score: ", np.sum(alignment_read.aQual)/len(alignment_read.read), "Aligned?", alignment_read.aligned)
    
    for key, value in read_dict.items():
        print(key, value)
    
# construct dataframe
aln_df = pd.DataFrame(aln_alignment_reads)

t_stop = time.time()
t_total = t_stop - t_start

In [None]:
head = 5

print("There are ", len(aln_df.index), " reads in this dataframe.")
print("This job took {} seconds to complete.".format(t_total))
print("Printing the first {} rows of the dataframe:".format(head))
aln_df.head(head)

In [None]:
aln_alignment_reads.append(aln_df[index - 1])
