# eQTL analysis

---
## Before Class
1. Review previous two classes
* Review slides on eQTL analsis

---
## Learning Objectives
1. Understand bulk processing of data
* Perform eQTL analysis

---
## Bulk data processing
Another way we can predict the effect of variants on expression is through study of the genotypes and not allele-specific expression. For today's class, we have expanded our study to include 100 DNA and RNA samples from diploid individuals from our same small reference genome. Our goal now will be to study if any of the variants in the genome can be associated with the expression of a gene. These variants are said to be eQTLs (expression quantitative trait loci) and help explain variation of expression levels of genes. It is important to note that while the variants are assoiciated with expression changes, they are not necessairly causative of the expression change.

The first step of today's class will be to just process all of the RNA and DNA data by aligning it to the genome and converting to bam files. The RNA bam files will need to be indexed as the previous class.

Hint: A python loop can be put around the shell commands we've previously used and variables can be used in the shell commands by surrounding them with curly brackets '{}'.
For example:
```
for i in range(10):
    !echo {i}
```

In [1]:
# Align all of the DNA reads; this will align all the DNA reads. 
for i in range(1, 101):
    !bowtie2 -f -x ../class_09/sample_genome -U data/DNA/sample_reads_P{i}.fa.gz -S data/DNA/sample_reads_P{i}.sam
    !samtools sort -o data/DNA/sample_reads_P{i}.bam data/DNA/sample_reads_P{i}.sam

22952 reads; of these:
  22952 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    22531 (98.17%) aligned exactly 1 time
    421 (1.83%) aligned >1 times
100.00% overall alignment rate
22952 reads; of these:
  22952 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    22547 (98.24%) aligned exactly 1 time
    405 (1.76%) aligned >1 times
100.00% overall alignment rate
22952 reads; of these:
  22952 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    22524 (98.14%) aligned exactly 1 time
    428 (1.86%) aligned >1 times
100.00% overall alignment rate
22952 reads; of these:
  22952 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    22532 (98.17%) aligned exactly 1 time
    420 (1.83%) aligned >1 times
100.00% overall alignment rate
22952 reads; of these:
  22952 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    22527 (98.15%) aligned exactly 1 time
    425 (1.85%) aligned >1 times
100.00% overall align

In [2]:
# Process RNA the same way:
for i in range(1, 101):
    !bowtie2 -f -x ../class_09/sample_genome -U data/RNA/sample_RNA_reads_P{i}.fa.gz -S data/RNA/sample_RNA_reads_P{i}.sam
    !samtools sort -o data/RNA/sample_RNA_reads_P{i}.bam data/RNA/sample_RNA_reads_P{i}.sam
    !samtools index data/RNA/sample_RNA_reads_P{i}.bam

5755 reads; of these:
  5755 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    5747 (99.86%) aligned exactly 1 time
    8 (0.14%) aligned >1 times
100.00% overall alignment rate
5741 reads; of these:
  5741 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    5731 (99.83%) aligned exactly 1 time
    10 (0.17%) aligned >1 times
100.00% overall alignment rate
4431 reads; of these:
  4431 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    4426 (99.89%) aligned exactly 1 time
    5 (0.11%) aligned >1 times
100.00% overall alignment rate
5372 reads; of these:
  5372 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    5370 (99.96%) aligned exactly 1 time
    2 (0.04%) aligned >1 times
100.00% overall alignment rate
5008 reads; of these:
  5008 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    4996 (99.76%) aligned exactly 1 time
    12 (0.24%) aligned >1 times
100.00% overall alignment rate
5562 reads; o

## Calculate the genotypes

Next, we can calculate the genotypes for all of the individuals. We did this two classes ago and now just need to keep track of it for each person. Our end goal of this step is mostly just reformatting the data (this is very common in bioinformatics). We would like to have an array of variants by samples that contains the genotype of a sample for each variant. To build a master variant list, we will be using the criteria of common variants (>1% population frequency). This reduces the number of tests that we will have to perform.

The first step of this will be to collate all of the variants into a list:

In [11]:
from genotyper import *

#calculate the genotypes for all the individuals, and now we need to keep track of it for each individual. 
#this function will 

def get_all_variants():
    ''' Function to iterate through all DNA files to get genotypes
    
    This functions runs our previous find_variants() function on all of the individuals and makes a list of these
    
    Example:
        >>> get_all_variants() #doctest: +ELLIPSIS
        [(803, 'C', 'A'), (5144, 'G', 'C'), (5275, 'T', 'T'), ...
    '''
    individual_variants = []

    # Note that this now starts at index 0
    for i in range(1, 101):
        # Run find_variants and store in a list indexed by sample
        individual_variants.append(find_variants(reference = '../class_09/data/sample_genome.fna', alignments = 'data/DNA/sample_reads_P' + str(i) + '.bam'))

    return individual_variants

In [2]:
# This takes a few minutes to run
#this sets it to a variable name. 
individual_genotypes = get_all_variants()




In [14]:
# print(individual_genotypes)

The next step will be to first identify our final variant list (`build_list_of_common_variants`) and then use that list along with the list of genotypes to build our final bulk matrix described above (`build_bulk_genotype`).

In [24]:
from collections import Counter
import numpy as np

def build_list_of_common_variants(individual_genotypes):
    ''' This function will combine all of the genotypes to create a master list of SNP locations
        for common variants. Common variants are those that appear at a frequency of more than 1%
        in the population.
        
        Note: we only need to track the position of the SNPs
    '''
    
    all_variant_counter = Counter()
    all_variant_list = []
    #which variants are the most common aka they are above 1% threshold statistically 
    
    
    # Here I use a Counter() to just count the frequency of each variant location across the samples
    for i, sample in enumerate(individual_genotypes):
        for j, variant in enumerate(sample):
            position, allele1, allele2 = variant
            all_variant_counter[position] += 1
            
            print(all_variant_counter)
    
    # Now we create a list from just the counters that are above 1% (or 1 since we have 100 samples)
    for key in list(all_variant_counter):
        if all_variant_counter[key] > 1:
            all_variant_list.append(key)
            
            # print(key)
            
    return all_variant_list
    
def build_bulk_genotype(all_variants, individual_genotypes):
    ''' This function will assign one of 3 genotypes at every SNP location
    ref/ref = 0  check and see if a particular position was called in a sample. 
    ref/alt = 1
    alt/alt = 2
    
    This will populate a varaint x individual matrix with the 2 genotypes maintaining individual order from 
    individual_genotypes and variant order from all_variants
    
    Pseudocode:
    For each variant
        For each individual
            if individual has variant assign value 1 if het, 2 if homozygous alt
            otherwise value is 0
    
    Args:
        all_variants (list): list of common variants
        individual_genotypes (list of list of tuples): list of all variants for each individual
        
    Returns:
        sample_variant_matrix (numpy int matrix): matrix of variants x individuals
 
    '''
    
    #all_variants (SNPs with just positions) so this SNP was present in at least 2 people in this 100 population group 
    #individl contains the position, ref all, alt all, 
    #index indivual_genotypes with positino from all variants, then compare alleles 
    #then plug and chug into the array. 
    #if ref/ref need to iterate through rest of pop and if they don't have it, put a "-" 
    
    #initialize the empty matrix here:
    sample_variant_matrix = np.zeros((len(all_variants), len(individual_genotpyes)), dtype=int)
    
    
    #for each variant
    for index, variant_pos in enumerate(all_variants):
        #for loop through each individual
        for index2, variants in enumerate(individual_genotypes):
            #if we have not seen the variant it is zero (homozygous) 
            #our initial (if its ref/ref)
            sample_genotype = 0
            
            #all the individual genotypes for each individual. 
            #must remember that each individual has it's own individual genotypes too, so you need to index for that too. 
            #that is why you need to index further. 
            for index3, variant in enumerate(individual_genotypes[index2]):
                pos, allele1, allele2 = variant
                if pos == variant_pos:
                    if allele1 == allele2:
                        #homozygous alternative 
                        sample_genotype = 2
                    if allele1 != allele2:
                        #heterozygous 
                        sample_genotype = 1
                    break 
                
            #need the position that is index of the all_variants and the individual_genotypes. 
            sample_variant_matrix[index][index2] = sample_genotype  
    
    
    return sample_variant_matrix
                
    #now need to iterate through the variant list (call the helper funtion and now the individual_genotypes. 
    # Pseudocode:
    # For each variant
    #     For each individual
    #         if individual has variant assign value 1 if het, 2 if homozygous alt
    #         otherwise value is 0
    
    
    #reference allele if you don't have a record for that position, then that's the reference. 
    
#take genotypes and associate them with gene expression in the samples 
#doing a linear model, how genotype as a category is associated with a continuous variable in the RPKMs. 
#we care about the category (homo, or hetero) 
    
#go through each of the tuples and check if they are 
#aggregate list of all the positions, 

In [27]:
# # # Now we get the final bulk_genotype matrix
# full_variant_list = build_list_of_common_variants(individual_genotypes)
# bulk_genotypes = build_bulk_genotype(full_variant_list, individual_genotypes)
# print(bulk_genotypes)

## Calculate the gene expression

In addition to the genotype matrix above, we also need to generate expression data for all of the individuals at each gene. To do this, we can use the methods from the previous class `get_transcript_levels` to get the expression level of each gene for each person. As above, we will store this as a matrix of gene by samples with RPKM in each position of the matrix.

In [15]:
from rna_tools import *

def get_all_expression():
    ''' Function to iterate through all RNA file to get expression levels of genes
    '''
    individual_expression_matrix = np.zeros((5,100), dtype=float)

    # Note that this now starts at index 0
    for i in range(1, 101):
        # Run our get_transcript_levels for all individuals
        individual_expression = get_transcript_levels('data/RNA/sample_RNA_reads_P' + str(i) + '.bam', '../class_10/data/sample_genomic.gff')
        for j, data in enumerate(individual_expression):
            seqid, start, end, expression = data
            individual_expression_matrix[j][i-1] = expression # Just save the expression value
        
    return individual_expression_matrix

In [16]:
# This takes a few minutes to run
bulk_expression = get_all_expression()

## eQTL Analysis

As discussed in the class materials, the variants that we have identified from our diploid samples represent polymorphic locations in the sample genomes. These sites can sometimes affect the regulation or expression level of a gene, often through disruption of transcription factor binding. Identifying variants associated with this change in expression is thus of great interest in understanding how the gene expression might be controlled. To do this, we use eQTL analysis.

Standard eQTL analysis involves direct pairwise testing between genetic markers (here SNPs) andd the gene expression levels across many (here 100) individuals. This can be done for all SNPs vs all genes, but this results in a large penalty when accounting for multiple testing. To avoid this, the analysis is often done in a window around each gene of ~ 1 magabase. For our example today, we can just test all of the variants because our genome is quite small.

Typical eQTL analysis uses a generalized linear model (GLM). For simplicity, we will use a simple linear model of the form $Y_{i} = b_{0} + b_{1}X_{i} + \epsilon_{i}$ where $Y_{i}$ is the RPKM for individual $i$ and $X_{i}$ indicates the genotype (0, 1, 2) at a given SNP. We can implement this with the `stats.linregress()` function in scipy ( https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.linregress.html ).

We can also plot these associations using `matplotlib.pyplot` ( https://matplotlib.org/api/_as_gen/matplotlib.pyplot.html ).

<img src = "figures/eQTL_fig.png">

Today, generate the variant and gene associations that are significant ($p < .01$) and plot these below:

In [18]:
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt

def eQTL_analysis(bulk_genotypes, bulk_expression):
    ''' Function to perform eQTL analysis
    
    Args:
        bulk_genotypes (2D np array): genotypes for each variant for each individual
        bulk_expression (2D np array): expression values for each gene for each individual
       
    Prints:
        significant eQTLs (p < .01) and plots them as above
    '''
    
    for gene_index in range(bulk_expression.shape[0]):
        y = np.array(bulk_expression[gene_index]) #give us all the values from the gene_index. 
        
        for snp_index in range(bulk_genotypes.shape[0]):
            xi = np.array(bulk_genotypes[snp_index])
            slope, intercept, r_value, p_value, std_err = stats.lineregress(xi, y)
            
            if p_value < 0.1:
                print("significant eQTL: ", "gene", gene_index, "snp", snp_index, "p_value", p_value)
                line = slope * xi * intercept 
                
                plt.plot(xi, line, "r-", xi, y, "o")
                plt.show()

In [23]:
eQTL_analysis(bulk_genotypes, bulk_expression)