# RNA expression levels

---
## Before Class
1. Review code from previous class

---
## Learning Objectives

1. Use BWT algorithm to map RNA reads to a genome
* Estimate RNA expression levels from RNA-seq data using RPKM
* Identify allele-specific expression

---
## Imports

In [2]:
import bamnostic as bs
from collections import Counter
from scipy.stats import binom_test

---
## Quantifying RNA expression levels

Calculating the relative expression of RNA involves mapping RNA sequence reads to a reference and then counting the relative abundance of each transcript. Tools doing this typically build a reference assembly out of known transcripts to speed processing and avoid complications of reads crossing splice junctions. However, it is also the case that reads can be mapped to a genome assembly directly to identify novel transcripts or when there are no concerns with intervening intronic sequence. In our class today, we will be implementing RNA-seq analysis for our sample genome that has no introns, and so we will use the reference assembly that was generated in the previous class.


In [None]:
# We can use our original get_fasta function to examine the fasta file for the genome
from data_readers import get_fasta
file = "../class_09/data/sample_genome.fna"

for name, seq in get_fasta(file):
    print(seq[1:300])

Similarly to the last class, we can map our RNA-seq reads to the reference using `bowtie2` and convert these to a bam file using `samtools`. Today we will also be creating an index of our alignment file so that we can randomly access the file

In [None]:
! bowtie2 -f -x ../class_09/sample_genome -U data/sample_RNA_reads.fa -S sample_RNA_reads.sam
! samtools sort -o sample_RNA_reads.bam sample_RNA_reads.sam
! samtools index sample_RNA_reads.bam

    "map RNA-seq reads to genome: bwa mem\n",
    "\n",
    "quantify counts in each gene: read in gff file, count reads on intervals, calculate rpkm\n",
    "\n",
    "one way we can study variation is to look at allele-specific expression\n",
    "identify allele specific expression: given our genotype from yesterday, test each allele for differential RNA-reads and perform same binomial test\n",

You have now mapped your reads to a reference genome using the Burrows-Wheeler algorithm! You can take a look at the sample_RNA_reads.sam file to see the plain text version of the alignments and the sample_RNA_reads.bam now contains a compressed version of the same alignments.

---
## Calculate read depth on intervals

Today we will be implementing an RPKM (Reads Per Kilobase of transcript, per Million mapped reads) calculation, which is essentially just counting reads on a genomic interval. To accomplish this, we will use the gff file of gene annotations for our genome and estimate RPKM as: 

RPKM = $ R_{G} / ( L_{G}/1000 * D/1000000)$

Where, $R_{G}$ are the number of reads $R$ mapping to gene $G$, $L_{G}$ is then length of gene $G$, and $D$ is the total read count.

As a practical note, because we are dealing with a tiny genome and very few reads, the RPKM values will be significantly higher than a typical RNA-seq experiment.


```
get_transcript_levels(bam_file, gene_annotations):
    For gene in genome:
        calculate RPKM
```


In [4]:
import bamnostic as bs
from data_readers import get_gff

def calculate_RPKM(gene_reads, gene_length, sequence_depth):
    ''' Calculate RPKM value 
        RPKM = 𝑅𝐺/(𝐿𝐺/1000∗𝐷/1000000)
        Where, 𝑅𝐺 are the number of reads 𝑅 mapping to gene 𝐺, 𝐿𝐺 is then length of gene 𝐺, and 𝐷 is the total read count.
        
    Args:
        gene_reads (int): number of reads overlapping a gene
        gene_length (int): size of the gene
        sequence_depth (int): total number of reads in our sample
    
    Returns:
        rpkm (int): rpkm value
        
    Example:
        >>> calculate_RPKM(1953, 1301, 6167) #doctest: +ELLIPSIS
        243417.05...
    '''
    #return gene_reads / (gene_length / 1000 * sequence_depth/123)
    return gene_reads / (gene_length/1000 * sequence_depth/1000000)

def get_read_count(rna_alignments, seqid, start_position, end_position):
    ''' Calculates the read depth of a bam in a region
    
    Args:
        rna_alignments (str): bam file of aligned reads (requires a bai file to have been created)
        seqid (str): name of the contig for region (ie. chromosome or 'sample')
        start_position (int): start position of region
        end_position (int): end position of region
        
    Returns:
        read_count (int): total number of reads overlapping region
        
    Example:
        >>> get_read_count('sample_RNA_reads.bam', 'sample', 336, 1637)
        1953
    '''
    bam = bs.AlignmentFile(rna_alignments, 'rb')
    return sum(1 for read in enumerate(bam.fetch(seqid, start_position, end_position)))

def get_transcript_levels(rna_alignments, gff_file):
    ''' Gets RPKM values for all genes in a gff_file
    
    Args:
        rna_alignments (str): bam file of aligned reads
        gff_file (str): gff formatted gene annotation file
        
    Returns:
        transcript_expression (list of tuples): list of tuples containing contig name, 
            start position of gene, end position of gene, and RPKM for that gene
            
    Example:
        >>> get_transcript_levels('sample_RNA_reads.bam', 'data/sample_genomic.gff')  #doctest: +ELLIPSIS
        [('sample', 336, 1637, 243417.05...
    '''
    
    transcript_expression = []
    
    bam = bs.AlignmentFile(rna_alignments, 'rb')
    read_depth = sum(1 for read in bam)

    for gff_entry in get_gff(gff_file):
        if gff_entry.type == 'CDS':
            gene_reads = get_read_count(rna_alignments, gff_entry.seqid, gff_entry.start, gff_entry.end)
            rpkm = calculate_RPKM(gene_reads, gff_entry.end-gff_entry.start, read_depth)
            transcript_expression.append((gff_entry.seqid, gff_entry.start, gff_entry.end, rpkm))
        
    return transcript_expression


ModuleNotFoundError: No module named 'data_readers'

In [None]:
get_transcript_levels('sample_RNA_reads.bam', 'data/sample_genomic.gff')

---
## Allele-specific expression

Combining the last class with todays class, we can also identify locations where the genotype of the sample is heterozygous, but the RNA preferentially comes from one allele. This is called allele-specific expression and is often cause by a transcriptional regulatory effect on the gene. For this section, we will be testing each of the genes for allele-specific expression by using our previously calculated genotypes and testing if there is any bias in the RNA-seq data at these same positions. We can utilize two functions from the previous class to help this calculation.

```
calculate_allelic_ratio(genotype, bam_file, gene_annotations):
    For gene in genome:
        Build pileup for gene regions
        for each known heterozygous variant overlapping region:
            Calculate allelic ratio (min allele / sum of alleles)
            test for heterozygosity 
```


In [1]:
from genotyper import get_pileup, binomial_test

def calculate_allelic_ratio(genotype, rna_alignments, gff_file):
    ''' Calculates allelic ratio for all genes in a gff_file
        We will perform this test at every heterozygous variant and report the average allelic ratio
        and the result of heterozygosity tests at each variant.
    
    Args:
        genotype (list of tuples): list of variants as (position, allele1, allele2)
        rna_alignments (str): bam file of aligned reads
        gff_file (str): gff formatted gene annotation file
        
    Returns:
        gene_allelic_ratios (list of tuples): list of tuples containing contig name, 
            start position of gene, end position of gene, average allelic ratio for that gene,
            and a list of results from heterozygosity tests at each variant in that gene
            
    Example:
    >>> variant_list = [(240, 'A', 'G'), (354, 'G', 'A')]
    >>> calculate_allelic_ratio(variant_list, 'sample_RNA_reads.bam', 'data/sample_genomic.gff')
    [('sample', 336, 1637, 0.3684210526315789, [True])]
    '''
    gene_allelic_ratios = []
    
    for gff_entry in get_gff(gff_file):
        if gff_entry.type == 'CDS':
          
            allelic_ratios = []
            binomial_tests = []
            
            # Build pileup
            rna_pileup = get_pileup(rna_alignments, gff_entry.start, gff_entry.end)
    
            # Test for bias at heterozygous sites using binomial test
            for i, variant in enumerate(genotype):
                location, allele1, allele2 = variant
                
                # If the variant is in the, calculate the allele frequencies
                if location >= gff_entry.start and location <= gff_entry.end:
                    alleles = rna_pileup[location-gff_entry.start-1] # offset by 1!
                    ratio = min(alleles[allele1], alleles[allele2])/(alleles[allele1] + alleles[allele2])
                    allelic_ratios.append(ratio)
                    binomial_tests.append(binomial_test(alleles[allele1], alleles[allele2]))
            
            if len(allelic_ratios) > 0:
                average_ratio = sum(allelic_ratios)/len(allelic_ratios)
                gene_allelic_ratios.append((gff_entry.seqid, gff_entry.start, gff_entry.end, average_ratio, binomial_tests))
    
    return gene_allelic_ratios


In [None]:
variant_list = [(240, 'A', 'G'), (354, 'G', 'A'), (803, 'C', 'A'), (1411, 'A', 'G'), (1799, 'C', 'G'), (1829, 'G', 'C'), (1910, 'T', 'C'), (1958, 'A', 'T'), (2119, 'A', 'G'), (2327, 'G', 'A'), (2404, 'A', 'G'), (2425, 'T', 'A'), (2605, 'C', 'T'), (2678, 'G', 'A'), (2788, 'A', 'T'), (2965, 'A', 'C'), (3017, 'T', 'G'), (3141, 'G', 'T'), (3383, 'G', 'C'), (3536, 'G', 'A'), (3779, 'G', 'T'), (3822, 'C', 'A'), (3908, 'G', 'C'), (4497, 'C', 'T'), (4548, 'A', 'C'), (4737, 'G', 'A'), (4823, 'G', 'A'), (5144, 'C', 'G'), (5275, 'G', 'T'), (5313, 'T', 'C'), (5522, 'A', 'G'), (5586, 'G', 'A'), (5624, 'T', 'A'), (5652, 'A', 'C'), (5910, 'C', 'A'), (6013, 'A', 'T'), (6303, 'A', 'G'), (6632, 'G', 'C'), (6734, 'A', 'C'), (7099, 'C', 'T'), (7376, 'A', 'C'), (7466, 'C', 'T'), (7970, 'C', 'G'), (8116, 'G', 'T'), (8294, 'T', 'C'), (8545, 'T', 'C'), (8788, 'C', 'A'), (8981, 'G', 'C')]
calculate_allelic_ratio(variant_list, 'sample_RNA_reads.bam', 'data/sample_genomic.gff')

In [None]:
import doctest
doctest.testmod()