# Read Coverage and Variant Calling

---
## Before Class
1. Install bowtie2, samtools, and bamnostic: `conda install bowtie2 samtools bamnostic`
* Review bowtie2, samtools, and bamnostic documentation
* Review Counter from collections
---
## Learning Objectives

1. Use BWT algorithm to map reads to a genome
* Implement 'pileup' method
* Call variants from DNA sequencing data

Before class:
`conda install bowtie2 bamnostic samtools`

---
## Imports

In [None]:
import bamnostic as bs
from collections import Counter
from scipy.stats import binom_test

---
## Mapping to a genome

In our previous class, we developed our own tool for mapping reads to a genome using the efficient BWT algorithm. Today, we will be using an existing implementation of this algorithm to align many reads to a small genome. 

For the next few classes, we will be working with a small genome that we will assume represents a sample from a diploid individual from a population. The genome itself is quite small at ~9kb and contains only a few genes to make analysis during class tractable. 


In [None]:
# We can use our original get_fasta function to examine the fasta file for the genome
from data_readers import get_fasta
file = "data/sample_genome.fna"

for name, seq in get_fasta(file):
    print(seq[1:300])

Our first step will be creating an index using BWT so that we can align reads. To accomplish this, we could use the code from class. However, because we only allow for exact matches, we wouldn't be able to identify variants in our data. Instead, we will be using an aligner that uses the same algorithm that we implemented but allows for some mismatches to the genome called Bowtie2 ( http://bowtie-bio.sourceforge.net/bowtie2/index.shtml ). Because this is building the index of the reference genome (the Burrows-Wheeler transform), you only have to do this once for our genome.

First, create an index of the reference genome using `bowtie2-build`:

In [None]:
# This gives us the usage information for bowtie2-build
! bowtie2-build

In [None]:
# From above, the correct format for building an index is:
# bowtie2-build <our genome FASTA file> <name of the index we create>
! bowtie2-build data/sample_genome.fna sample_genome

Once you have created an index, you can map our reads to the genome. We have a set of simulated illumina DNA reads from this genome available in `data/sample_reads.fa`. To accomplish this, use `bowtie2` and write your files to `sample_reads.sam` using the `-S` option.

In [None]:
# This gives us the usage information for bowtie2
! bowtie2

In [None]:
# From above, the correct format for building an index is:
# bowtie2 -f -x <name of the index we created> -U <FASTA of sequence reads> -S <name of SAM output file>
# *We use -f because our input is a fasta file
! bowtie2 -f -x sample_genome -U data/sample_reads.fa -S sample_reads.sam

You will also need to convert the sam file into a bam file using samtools sort. You can read more about samtools: http://www.htslib.org/

In [None]:
# This gives us the usage information for samtools sort
! samtools sort

In [None]:
# From above, the correct format for building the bam file is:
# samtools sort -o <name of bam file output> <name of sam file input>
! samtools sort -o sample_reads.bam sample_reads.sam

You have now mapped your reads to a reference genome using the Burrows-Wheeler algorithm! You can take a look at the sample_reads.sam file to see the plain text version of the alignments and the sample_reads.bam now contains a compressed version of the same alignments.

---
## Identifying variants in the genome

For the second section of today's class, we will be identifying variants in our diploid genome. We will be using the `bamnostic` package ( https://github.com/betteridiot/bamnostic ) to work with our aligned reads from a bam file.

To identify variants, we will test each position for non-reference alleles and perform a binomial test to determine if there is indeed a variant or just a sequencing error at that position. This algorithm is somtimes referred to as a 'pileup'.

```
find_variants(bam_file):
    For each position in genome:
        count allele frequencies (pileup)
        test for heterozygosity
        test for homozygous alternative allele
```

Note: There are multuple ways to implement this, however we recommend using `Counter()` from `collections` that has been discussed and demonstrated multiple times on the office hours live streams. The binomial test can be implemented directly from the equation below, or you can use scipy.stats.

Binomial test is calculated as: $P(X=k) = {n \choose k}p^{k}(1-p)^{n-k}$ where $k$ is the allele count, $n$ is the total number of reads, and $p$ is 0.50.

In [2]:
import bamnostic as bs
from collections import Counter
from scipy.stats import binom_test

def get_pileup(alignments, region_start = None, region_end = None):
    ''' Function build a read pileup list
        this is implemented as a list of Counter() from region_start to region_end
        with our small genome, it is reasonable to cover the entire genome
        but for larger genomes a smaller window is required.
    
    Args:
        alignments (str): bam file of alignments
        region_start (int): start position to build pileup
        region_end (int): end position to build pileup
    
    Returns:
        genome (list of Counter()): a list from region_start to region_end of
            Counters of allele frequencies
        
    Example:
        >>> get_pileup('sample_reads.bam') #doctest: +ELLIPSIS +NORMALIZE_WHITESPACE 
        [Counter({'G': 5}), Counter({'G': 8}), Counter({'T': 12}), ...] 
    '''
    bam = bs.AlignmentFile(alignments, 'rb')

    if(region_start == None):
        region_start = 0
        
    if(region_end == None):
        region_end = bam.header['SQ'][0]['LN']
        
    genome = [Counter() for _ in range(region_start, region_end)]
    
    for read in bam:
        for i, base in enumerate(read.seq, read.pos):
            print(base)
            if i >= region_start and i <= region_end:
                genome[i-region_start].update(base) #sets us back to 0 being the region_start.
    
    
    return genome

def binomial_test(major, minor):
    ''' Function to perform binomial test
        We will consider a Pvalue threshold of 0.10: 
        SNPs for which the P value of the binomial test < 0.10 failed the heterozygosity test.

    Args:
        major (int): count of most frequent allele
        minor (int): count of second most frequent allele
    
    Returns:
        is_above_threshold (bool): true if passes heterozygosity test, otherwise false
        
    Example:
        >>> binomial_test(8, 4)
        True
    '''
    p_val = binom_test(major, major+minor, 1/2)
    if p_val >= .1:
        return True

    return False

def find_variants(reference, alignments):
    ''' Function to find variants given sequencing alignments and a reference
        Identify variants that are heterozygous using heterozygosity test
        Identify variants that are homozygous alternative allele
        
        Note: Variants are reported as 1-based coordinates
        
    Args:
        reference (str): genome reference fasta file
        alignments (str): sequence alignments
    
    Returns:
        variant_list (list of tuples): list of variants as (position, allele1, allele2)
        
    Example:
        >>> find_variants(reference = 'data/sample_genome.fna', alignments = 'sample_reads.bam') #doctest: +ELLIPSIS +NORMALIZE_WHITESPACE 
        [(240, 'A', 'G'), (354, 'G', 'A'), (803, 'C', 'A'), ...]
    '''
    
    variant_list = []
    
    # Get our genome to track variants
    genome = None
    for name, seq in get_fasta(reference):
        genome = seq

    # Build pileup
    genome_pileup = get_pileup(alignments, 0, len(genome))
    
    # Iterate through genome to identify variants
    for i, counter in enumerate(genome_pileup, 1): # start at 1 since variants are reported as 1-based
        ref_allele = genome[i-1] # Expected allele is our reference allele:
        top_alleles = counter.most_common(2) # Get two top alleles
        
        if len(counter) > 1: #test for heterozygous alleles
            if binomial_test(top_alleles[0][1], top_alleles[1][1]):
                # This is a SNP:
                variant_list.append((i, top_alleles[0][0], top_alleles[1][0]))
        elif len(counter) == 0: #no reads here
            next
        else: # test for homozygous alternate site
            if counter[ref_allele] == 0:
                print (top_alleles)
                variant_list.append((i, top_alleles[0][0], top_alleles[0][0]))

    return variant_list

In [3]:
print(find_variants(reference = 'data/sample_genome.fna', alignments = 'sample_reads.bam'))

NameError: name 'get_fasta' is not defined

In [None]:
import doctest
doctest.testmod()