# RNA expression levels

---
## Before Class
1. Review code from previous class

---
## Learning Objectives

1. Use BWT algorithm to map RNA reads to a genome
* Estimate RNA expression levels from RNA-seq data using RPKM
* Identify allele-specific expression

---
## Imports

In [46]:
import bamnostic as bs
from collections import Counter
from scipy.stats import binom_test

---
## Quantifying RNA expression levels

Calculating the relative expression of RNA involves mapping RNA sequence reads to a reference and then counting the relative abundance of each transcript. Tools doing this typically build a reference assembly out of known transcripts to speed processing and avoid complications of reads crossing splice junctions. However, it is also the case that reads can be mapped to a genome assembly directly to identify novel transcripts or when there are no concerns with intervening intronic sequence. In our class today, we will be implementing RNA-seq analysis for our sample genome that has no introns, and so we will use the reference assembly that was generated in the previous class.


In [2]:
# We can use our original get_fasta function to examine the fasta file for the genome
from data_readers import get_fasta
file = "../class_09/data/sample_genome.fna"

for name, seq in get_fasta(file):
    print(seq[1:300])

GTCTCTCTGGTTAGACCAGATCTGAGCCTGGGAGCTCTCTGGCTAACTAGGGAACCCACTGCTTAAGCCTCAATAAAGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCTGGTAACTAGAGATCCCTCAGACCCTTTTAGTCAGTGTGGAAAATCTCTAGCAGTGGCGCCCGAACAGGGACCTGAAAGCGAAAGGGAAACCAGAGGAGCTCTCTCGACGCAGGACTCGGCTTGCTGAAGCGCGCACGGCAAGAGGCGAGGGGCGGCGACTGGTGAGTACGCC


Similarly to the last class, we can map our RNA-seq reads to the reference using `bowtie2` and convert these to a bam file using `samtools`. Today we will also be creating an index of our alignment file so that we can randomly access the file

In [3]:
! bowtie2 -f -x ../class_09/sample_genome -U data/sample_RNA_reads.fa -S sample_RNA_reads.sam
! samtools sort -o sample_RNA_reads.bam sample_RNA_reads.sam
! samtools index sample_RNA_reads.bam

6167 reads; of these:
  6167 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    6148 (99.69%) aligned exactly 1 time
    19 (0.31%) aligned >1 times
100.00% overall alignment rate


You have now mapped your reads to a reference genome using the Burrows-Wheeler algorithm! You can take a look at the sample_RNA_reads.sam file to see the plain text version of the alignments and the sample_RNA_reads.bam now contains a compressed version of the same alignments.

---
## Calculate read depth on intervals

Today we will be implementing an RPKM (Reads Per Kilobase of transcript, per Million mapped reads) calculation, which is essentially just counting reads on a genomic interval. To accomplish this, we will use the gff file of gene annotations for our genome and estimate RPKM as: 

RPKM = $ R_{G} / ( L_{G}/1000 * D/1000000)$

Where, $R_{G}$ are the number of reads $R$ mapping to gene $G$, $L_{G}$ is then length of gene $G$, and $D$ is the total read count.

As a practical note, because we are dealing with a tiny genome and very few reads, the RPKM values will be significantly higher than a typical RNA-seq experiment.


```
get_transcript_levels(bam_file, gene_annotations):
    For gene in genome:
        calculate RPKM
```


In [57]:
import bamnostic as bs
from data_readers import get_gff

def calculate_RPKM(gene_reads, gene_length, sequence_depth):
    ''' Calculate RPKM value 
        RPKM = 𝑅𝐺/(𝐿𝐺/1000∗𝐷/1000000)
        Where, 𝑅𝐺 are the number of reads 𝑅 mapping to gene 𝐺, 𝐿𝐺 is then length of gene 𝐺, and 𝐷 is the total read count.
        
    Args:
        gene_reads (int): number of reads overlapping a gene
        gene_length (int): size of the gene
        sequence_depth (int): total number of reads in our sample
    
    Returns:
        rpkm (int): rpkm value
        
    Example:
        >>> calculate_RPKM(1953, 1301, 6167) #doctest: +ELLIPSIS
        243417.05...
    '''
    
    #need to calculate the RPKM value. 
    rpkm = (gene_reads)/(gene_length/1000*(sequence_depth)/1000000)
    
    #tested this and it's good. 
    return rpkm

    
def get_read_count(rna_alignments, seqid, start_position, end_position):
    ''' Calculates the read depth of a bam in a region
    
    Args:
        rna_alignments (str): bam file of aligned reads (requires a bai file to have been created)
        seqid (str): name of the contig for region (ie. chromosome or 'sample')
        start_position (int): start position of region
        end_position (int): end position of region
        
    Returns:
        read_count (int): total number of reads overlapping region
        
    Example:
        >>> get_read_count('sample_RNA_reads.bam', 'sample', 336, 1637)
        1953
    '''
    
    #use bam fetch and enumerate. 
    
    #we are opening the bam file but looking at a very specific region of the genome (start_pos and start_end) to calculate the 
    bam = bs.AlignmentFile(rna_alignments, "rb")
    # first_read = next(bam)
    #first_read.seq, first_read.pos, first_read.read_name 
    # return first_read.pos
    read_count_lst = []
    #you want random access. 
    #we would want to use bam.fetch 
    for i, read in enumerate(bam.fetch(seqid, start_position, end_position)):
        read_count_lst.append(read)
        read_count = len(read_count_lst)
    
    return read_count
        
            
def get_transcript_levels(rna_alignments, gff_file):
    ''' Gets RPKM values for all genes in a gff_file
    
    Args:
        rna_alignments (str): bam file of aligned reads
        gff_file (str): gff formatted gene annotation file
        
    Returns:
        transcript_expression (list of tuples): list of tuples containing contig name, 
            start position of gene, end position of gene, and RPKM for that gene
            
    Example:
        >>> get_transcript_levels('sample_RNA_reads.bam', 'data/sample_genomic.gff')  #doctest: +ELLIPSIS
        [('sample', 336, 1637, 243417.05...
    '''
    
    #output: needs to be a list of tuples (contig name, start_pos, end_pos, RPKM)
    transcript_expression = [] 
    
    #sequence_depth = total number of reads in our bam sample. 
    
    for entry in get_gff(gff_file):  #need to get the entry 
        if entry.type == "CDS": #if the entry is a coding sequence, want to calculate the RPKM
            #the entire read depth for a certain region in the genome. 
            #need to have values for gene_reads and sequence_depth. 
            
            #gene_reads: number of reads overlapping the gene in that specific region. 
            gene_reads = get_read_count(rna_alignments, entry.seqid, entry.start, entry.end)
            
            #sequence_depth: number of reads total in the file. 
            bam = bs.AlignmentFile(rna_alignments, "rb")
            sequence_depth_ls = []
            for i, read in enumerate(bam):
                sequence_depth_ls.append(read)
                sequence_depth = len(sequence_depth_ls)
            
            #now you can calcualte the rpkm for each entry in the gff file. 
            entry_rpkm = calculate_RPKM(gene_reads, entry.end - entry.start, sequence_depth)  
        
            #WANT: list of tuples: contig name, start position of gene, end position of gene, RPKM for that gene. 
            transcript_expression.append((entry.seqid, entry.start, entry.end, entry_rpkm))
    
    return transcript_expression
        


In [58]:
get_transcript_levels('sample_RNA_reads.bam', 'data/sample_genomic.gff')

[('sample', 336, 1637, 243417.05193158844),
 ('sample', 4587, 5165, 405944.5772032523),
 ('sample', 5516, 5591, 492946.3272255554),
 ('sample', 6919, 7488, 365343.8578202537),
 ('sample', 8343, 8963, 324829.8697018993)]

In [50]:
get_read_count("sample_RNA_reads.bam", "sample", 336, 1637)

int

In [12]:
calculate_RPKM(1953, 1301, 6167)

243417.05193158844

In [48]:
lst = ["A", "B", "C"]
for i, read in enumerate(lst):
    print(read)

A
B
C


---
## Allele-specific expression

Combining the last class with todays class, we can also identify locations where the genotype of the sample is heterozygous, but the RNA preferentially comes from one allele. This is called allele-specific expression and is often cause by a transcriptional regulatory effect on the gene. For this section, we will be testing each of the genes for allele-specific expression by using our previously calculated genotypes and testing if there is any bias in the RNA-seq data at these same positions. We can utilize two functions from the previous class to help this calculation.

```
calculate_allelic_ratio(genotype, bam_file, gene_annotations):
    For gene in genome:
        Build pileup for gene regions
        for each known heterozygous variant overlapping region:
            Calculate allelic ratio (min allele / sum of alleles)
            test for heterozygosity 
```


In [127]:
from genotyper import get_pileup, binomial_test

def calculate_allelic_ratio(genotype, rna_alignments, gff_file):
    ''' Calculates allelic ratio for all genes in a gff_file
        We will perform this test at every heterozygous variant and report the average allelic ratio
        and the result of heterozygosity tests at each variant.
    
    Args:
        genotype (list of tuples): list of variants as (position, allele1, allele2)
        rna_alignments (str): bam file of aligned reads
        gff_file (str): gff formatted gene annotation file
        
    Returns:
        gene_allelic_ratios (list of tuples): list of tuples containing contig name, 
            start position of gene, end position of gene, average allelic ratio for that gene,
            and a list of results from heterozygosity tests at each variant in that gene
            
    Example:
    >>> variant_list = [(240, 'A', 'G'), (354, 'G', 'A')]
    >>> calculate_allelic_ratio(variant_list, 'sample_RNA_reads.bam', 'data/sample_genomic.gff')
    [('sample', 336, 1637, 0.3684210526315789, [True])]
    '''
    
    #must return an empty list containing: contig name, start_pos, end_pos, average ellelic ratio, a list of heterozygosity tests 
    gene_allelic_ratios = []
   
    
    #we have get_pileup and binomial_test functions
    
    for entry in get_gff(gff_file):  #for gene in the genome. 
        if entry.type == "CDS": #needs to be a coding sequence.
            allele_ratio_lst = []
            binomial_tests = []
            ratio_average = None #for each gene, it will automatically have "None", but if there is at least one variant, then we'll have the ratio_average. 
            #first need to create the pileup to build a read pileup list. 
            #this returns a list of Counters for each position of how many bases are there. 
            #builds the pileup for gene regions. 
            allele_pileup = get_pileup(rna_alignments, entry.start, entry.end) 
            # print(entry.start)
            # print(entry.end)
            
            #now we need to test heterozygous variant overlapping region. 
            #need to have an actual position to figure out where we're at in the genome. 
            for i, read in enumerate(genotype):
                position, allele1, allele2 = read 
                # print(position, allele1, allele2) 
                # print(genotype)
            
            #the position must be within the start and the end 
                if position >= entry.start and position <= entry.end:
                    #so if our position is in the right region, now we're going to gather all the alleles there in the pileup.
                    all_alleles = allele_pileup[position - entry.start - 1] #sets us back to the zero index. by convention, the genome is 1-index, so position variable is 1-indexed. 
                    #to work with a python list, you must subtract 1 to make it 0-indexed, which python is based off of. We're subtracting entry.start due to our python list starting at the 
                    #gene of interest, NOT at the beginning of the entire genome. 
                    # print(all_alleles)
                    # Calculate allelic ratio (min allele / sum of alleles)
                    #need to index at that certain position to tell which allele you want. 
                    # print(all_alleles[allele1])
                    # print(all_alleles[allele2])
                
                    
                    allele_ratio = min(all_alleles[allele1], all_alleles[allele2])/(all_alleles[allele1] + all_alleles[allele2])
                    
                    
                    allele_ratio_lst.append(allele_ratio) 
                    #we want to do this for every single read. so need to append to an empty list. 

                    #need binomial test which will return True or False. 
                    allele_binomial_test = binomial_test(all_alleles[allele1], all_alleles[allele2])
                    #probably need to add this to an empty list too to account for the iterations. 
                    binomial_tests.append(allele_binomial_test)
            
            
            #first, does a position must have more than 1 counter entry. 
            if len(allele_ratio_lst) > 0: #prevents me from dividing by 0 in case of no variants. 
                #now we need the average allelic ratio for that gene. sum/len
                
                ratio_average = sum(allele_ratio_lst)/len(allele_ratio_lst)
            
            
            gene_allelic_ratios.append((entry.seqid, entry.start, entry.end, ratio_average, binomial_tests)) 
                
    
    return gene_allelic_ratios 
    

[('sample', 336, 1637, 0.3684210526315789, [True])]

In [128]:
variant_list = [(240, 'A', 'G'), (354, 'G', 'A')]
calculate_allelic_ratio(variant_list, 'sample_RNA_reads.bam', 'data/sample_genomic.gff')

[('sample', 336, 1637, 0.3684210526315789, [True]),
 ('sample', 4587, 5165, None, []),
 ('sample', 5516, 5591, None, []),
 ('sample', 6919, 7488, None, []),
 ('sample', 8343, 8963, None, [])]

In [121]:
variant_list = [(240, 'A', 'G'), (354, 'G', 'A'), (803, 'C', 'A'), (1411, 'A', 'G'), (1799, 'C', 'G'), (1829, 'G', 'C'), (1910, 'T', 'C'), (1958, 'A', 'T'), (2119, 'A', 'G'), (2327, 'G', 'A'), (2404, 'A', 'G'), (2425, 'T', 'A'), (2605, 'C', 'T'), (2678, 'G', 'A'), (2788, 'A', 'T'), (2965, 'A', 'C'), (3017, 'T', 'G'), (3141, 'G', 'T'), (3383, 'G', 'C'), (3536, 'G', 'A'), (3779, 'G', 'T'), (3822, 'C', 'A'), (3908, 'G', 'C'), (4497, 'C', 'T'), (4548, 'A', 'C'), (4737, 'G', 'A'), (4823, 'G', 'A'), (5144, 'C', 'G'), (5275, 'G', 'T'), (5313, 'T', 'C'), (5522, 'A', 'G'), (5586, 'G', 'A'), (5624, 'T', 'A'), (5652, 'A', 'C'), (5910, 'C', 'A'), (6013, 'A', 'T'), (6303, 'A', 'G'), (6632, 'G', 'C'), (6734, 'A', 'C'), (7099, 'C', 'T'), (7376, 'A', 'C'), (7466, 'C', 'T'), (7970, 'C', 'G'), (8116, 'G', 'T'), (8294, 'T', 'C'), (8545, 'T', 'C'), (8788, 'C', 'A'), (8981, 'G', 'C')]
calculate_allelic_ratio(variant_list, 'sample_RNA_reads.bam', 'data/sample_genomic.gff')

[('sample', 336, 1637, 0.4087547299621603, [True, True, True]),
 ('sample', 4587, 5165, 0.09736648282501408, [False, False, False]),
 ('sample', 5516, 5591, 0.4119047619047619, [True, True]),
 ('sample', 6919, 7488, 0.4762833008447043, [True, True, True]),
 ('sample', 8343, 8963, 0.20277935337492908, [False, False])]

In [94]:
min(12,7)

7

In [96]:
(12 + 7)

19

In [122]:
import doctest
doctest.testmod()

**********************************************************************
File "__main__", line 20, in __main__.calculate_allelic_ratio
Failed example:
    calculate_allelic_ratio(variant_list, 'sample_RNA_reads.bam', 'data/sample_genomic.gff')
Expected:
    [('sample', 336, 1637, 0.3684210526315789, [True])]
Got:
    [('sample', 336, 1637, 0.3684210526315789, [True]), ('sample', 4587, 5165, None, []), ('sample', 5516, 5591, None, []), ('sample', 6919, 7488, None, []), ('sample', 8343, 8963, None, [])]
**********************************************************************
1 items had failures:
   1 of   2 in __main__.calculate_allelic_ratio
***Test Failed*** 1 failures.


TestResults(failed=1, attempted=5)

In [59]:
lst = [("A", 1, "1.2"), ("B", 2, "2.2"), ("C", 3, "3.3")]

In [63]:
for i, read in enumerate(lst):
    seqid, num, phrase = read 
    print(phrase)

1.2
2.2
3.3


In [35]:
def get_pileup(alignments, region_start = None, region_end = None):
    ''' Function build a read pileup list
        this is implemented as a list of Counter() from region_start to region_end
        with our small genome, it is reasonable to cover the entire genome
        but for larger genomes a smaller window is required.
    
    Args:
        alignments (str): bam file of alignments
        region_start (int): start position to build pileup
        region_end (int): end position to build pileup
    
    Returns:
        genome (list of Counter()): a list from region_start to region_end of
            Counters of allele frequencies
        
    Example:
        >>> get_pileup('sample_reads.bam') #doctest: +ELLIPSIS +NORMALIZE_WHITESPACE 
        [Counter({'G': 5}), Counter({'G': 8}), Counter({'T': 12}), ...] 
    '''
    bam = bs.AlignmentFile(alignments, 'rb')

    if(region_start == None):
        region_start = 0
        
    if(region_end == None):
        region_end = bam.header['SQ'][0]['LN']
        
    genome = [Counter() for _ in range(region_start, region_end)]
    
    for read in bam:
        for i, base in enumerate(read.seq, read.pos):
            if i >= region_start and i <= region_end:
                genome[i-region_start].update(base)
  
    return genome

In [66]:
test1 = get_pileup('sample_reads.bam')
test1[0]

Counter({'G': 5})