# homework 07: Wiggins' lost labels
* Eric Yang
* 10/25/20

Goal: figure out the missing labels on Wiggins' files with DGEA via edgeR, analyze Wiggins' result of 2246 significantly differentially expressed genes

In [36]:
import pandas as pd
import numpy as np

## 1. write a python function to run an external edgeR analysis

In [33]:
# runs edgeR on given counts file, returns gene_names, log_fold_changes, log_CPM, pvalues, FDRs
def run_edgeR(counts_file, outfile_name):
    # write R file
    with open('analyze.R','w') as outfile:
        outfile.write('library(edgeR)' + '\n')
        outfile.write('infile <- ' + '"' + counts_file + '"'+ '\n')
        outfile.write('group <- factor(c(1,1,1,2,2,2))' + '\n')
        outfile.write('outfile <-' + '"' + outfile_name + '"' + '\n')
        outfile.write('x <- read.table(infile, sep="\t", row.names=1)' + '\n')
        outfile.write('y <- DGEList(counts=x,group=group)' + '\n')
        outfile.write('y <- estimateDisp(y)' + '\n')
        outfile.write('et <- exactTest(y)' + '\n')
        outfile.write('tab <- topTags(et, nrow(x))' + '\n')
        outfile.write('write.table(tab, file=outfile)' + '\n')
    
    # run R script
    ! Rscript analyze.R
    
    # parse output file from edgeR
    data = pd.read_table(outfile_name,sep='\s+')
    gene_names = data.index
    log_FC = data['logFC'].values
    log_CPM = data['logCPM'].values
    pvalues = data['PValue'].values
    FDRs = data['FDR'].values
    
    return gene_names, log_FC, log_CPM, pvalues, FDRs

## 2. reproduce Wiggins' data, assign the missing labels

In [34]:
# merge data sets
# merge 1 and 2
! join -t $'\t' w07-data.1 w07-data.2 > merged.12
# merge 1 and 3
! join -t $'\t' w07-data.1 w07-data.3 > merged.13
# merge 2 and 3
! join -t $'\t' w07-data.2 w07-data.3 > merged.23

In [35]:
# run edgeR on merged data
gene_names12, log_FC12, log_CPM12, pvalues12, FDRs12 = run_edgeR('merged.12','result12.out')
gene_names13, log_FC13, log_CPM13, pvalues13, FDRs13 = run_edgeR('merged.13','result13.out')
gene_names23, log_FC23, log_CPM23, pvalues23, FDRs23 = run_edgeR('merged.23','result23.out')

Loading required package: limma
Design matrix not provided. Switch to the classic mode.




Loading required package: limma
Design matrix not provided. Switch to the classic mode.
Loading required package: limma
Design matrix not provided. Switch to the classic mode.


Which combination did Wiggins run to obtain his result of 2246 differentially expressed genes significant at P<0.05?

In [38]:
# given p-value threshold = 0.05, count number of p-values less than 0.05 for each merged dataset
thresh = 0.05
r12 = np.where(pvalues12 < thresh)
print("merged12: ", len(r12[0]))

r13 = np.where(pvalues13 < thresh)
print("merged13: ", len(r13[0]))

r23 = np.where(pvalues23 < thresh)
print("merged23: ", len(r23[0]))

merged12:  2246
merged13:  988
merged23:  2312


Wiggins got the result of 2246 differentially expressed genes from the combination of w07-data.1 and w07-data.2

Which of the three files corresponds to the mutant sand mouse samples? Why?

Inferring from Wiggins' incorrect statistical procedure, it looks like w07-data.2 corresponds to the mutant sand mouse sample. There are significantly less p-values that meet Wiggins threshold of p-value < 0.05 in the merged 1 and 3 dataset compared to the two other merged combinations. While the statistical analyses is incomplete, from looking at the p-values alone, we can infer that the expression profiles in dataset 1 and dataset 3 are more similar to one another than each to dataset 3, suggesting that the samples in dataset 2 come from the mutant. Wiggins' procedure tells us that there are less significant different gene expression profiles between datasets 1 and 3 given the threshold. However, this analysis is incomplete, which we will explore further in parts 3 and 4.

## 3. Wiggins doesn't understand p-values

I do not agree with Wiggins' conclusion that 2246 genes are differentially expressed in the wt vs mutant comparison. First, it is unclear why he selected 2246 instead of 2312 obtained by comparing datasets 1 and 3. But the bigger problem is that Wiggins did not correct for the multiple null hypotheses testing done by edgeR. Each p value in the output describes the probability we get a value as extreme given that there is no expression difference **for that gene**. As we have learned, if we apply a p-value threshold to all the tests, we often result in a lot of false positives simply due to chance, especially if the number of tests is high. There are many ways to correct for the multiple testing problem. Instead of lowering the p-value (Bonferroni) which is often times too stringent, edgeR adopts the more robust Benjamini-Hochberg procdure to account for the false discovery rate (FDR), which Wiggins failed to analyze. By accounting for the FDR, we are able to control the number of false positives we make out of all the hypotheses tested, more confidently concluding the number of genes that are truly differentially expressed.

In [60]:
len(idx12[0])

74

In [62]:
# FDR threshold to determine appropriate statistical cutoff
FDR_thresh = 0.05
idx12 = np.where(FDRs12 < 0.05)
print("# significantly expressed 1 vs 2: ", len(idx12[0])) 
print("new p-value cutoff 1 vs 2: ", pvalues12[np.max(idx12)])
idx13 = np.where(FDRs13 < 0.05)
print("# significantly expressed 1 vs 3: ", len(idx13[0])) 
idx23 = np.where(FDRs23 < 0.05)
print("# significantly expressed 2 vs 3: ", len(idx23[0])) 
print("new p-value cutoff 2 vs 3: ", pvalues23[np.max(idx23)])

# significantly expressed 1 vs 2:  74
new p-value cutoff 1 vs 2:  0.00017823136576347
# significantly expressed 1 vs 3:  0
# significantly expressed 2 vs 3:  70
new p-value cutoff 2 vs 3:  0.00016714395691623402


The more appropriate statistical cutoff here is to use a FDR cutoff of 0.05, which is typically done, instead of a universal p-value threshold. Given the new threshold, there are 74 differentially expressed genes between samples 1 and 2, 0 between samples 1 and 3, and 70 between samples 2 and 3. The corresponding p-values are also shown above. These counts are much less than 2246, with FDR cutoff being much stricter than the simple p<0.05. We are now more certain that most of our discoveries are true positives and the result between samples 1 and 3 confirms our labeling of sample 2 being the mutant.

## 4. Wiggins missed something else too

Looking at Wiggins edgeR pipeline, it looks like Wiggins forgot to normalize for indirect effects on relative abundance. As we have learned in the course, RNA seq experiments give us relative abundances, which is a zero sum game. For example, let's say there are 2 genes A and B that are mapped and compared between an experimental condition and a control. If gene A is overexpressed in an experimental condition, and gene B expression from the experimental condition stays the same as gene B expression from the control, the relative abundance of gene B from the condition will be lower since it holds a lower proportion out of all the samples in the experimental condition, even though its absolute counts did not change compared to the control. If we do not account for this, we would mistakenly conclude that gene B is underexpressed in the experimental condition. EdgeR has a step called calcNormFactors that implements trimmed mean of M-values (TMM) to deal with this. EdgeR essentially "trims" the outliers and recalculates the means to figure out how much of this relative abundance shift occured and corrects for it. 

In [63]:
# add calcNormFactors in pipeline to correct for relative abundance
# runs edgeR on given counts file, returns gene_names, log_fold_changes, log_CPM, pvalues, FDRs
def run_edgeR_norm(counts_file, outfile_name):
    # write R file
    with open('analyze.R','w') as outfile:
        outfile.write('library(edgeR)' + '\n')
        outfile.write('infile <- ' + '"' + counts_file + '"'+ '\n')
        outfile.write('group <- factor(c(1,1,1,2,2,2))' + '\n')
        outfile.write('outfile <-' + '"' + outfile_name + '"' + '\n')
        outfile.write('x <- read.table(infile, sep="\t", row.names=1)' + '\n')
        outfile.write('y <- DGEList(counts=x,group=group)' + '\n')
        outfile.write('y <- calcNormFactors(y)' + '\n')
        outfile.write('y <- estimateDisp(y)' + '\n')
        outfile.write('et <- exactTest(y)' + '\n')
        outfile.write('tab <- topTags(et, nrow(x))' + '\n')
        outfile.write('write.table(tab, file=outfile)' + '\n')
    
    # run R script
    ! Rscript analyze.R
    
    # parse output file from edgeR
    data = pd.read_table(outfile_name,sep='\s+')
    gene_names = data.index
    log_FC = data['logFC'].values
    log_CPM = data['logCPM'].values
    pvalues = data['PValue'].values
    FDRs = data['FDR'].values
    
    return gene_names, log_FC, log_CPM, pvalues, FDRs

In [68]:
# run correct ededgeR on merged data
gene_names12_norm, log_FC12_norm, log_CPM12_norm, pvalues12_norm, FDRs12_norm = run_edgeR_norm('merged.12','result12_norm.out')
gene_names13_norm, log_FC13_norm, log_CPM13_norm, pvalues13_norm, FDRs13_norm = run_edgeR_norm('merged.13','result13_norm.out')
gene_names23_norm, log_FC23_norm, log_CPM23_norm, pvalues23_norm, FDRs23_norm = run_edgeR_norm('merged.23','result23_norm.out')

Loading required package: limma
Design matrix not provided. Switch to the classic mode.




Loading required package: limma
Design matrix not provided. Switch to the classic mode.
Loading required package: limma
Design matrix not provided. Switch to the classic mode.


In [69]:
# FDR threshold to determine appropriate statistical cutoff
FDR_thresh = 0.05
idx12_norm = np.where(FDRs12_norm < 0.05)
print("# significantly expressed 1 vs 2: ", len(idx12_norm[0])) 
print("new p-value cutoff 1 vs 2: ", pvalues12_norm[np.max(idx12_norm)])
idx13_norm = np.where(FDRs13_norm < 0.05)
print("# significantly expressed 1 vs 3: ", len(idx13_norm[0])) 
idx23_norm = np.where(FDRs23_norm < 0.05)
print("# significantly expressed 2 vs 3: ", len(idx23_norm[0])) 
print("new p-value cutoff 2 vs 3: ", pvalues23_norm[np.max(idx23_norm)])

# significantly expressed 1 vs 2:  53
new p-value cutoff 1 vs 2:  0.000118446557859175
# significantly expressed 1 vs 3:  0
# significantly expressed 2 vs 3:  51
new p-value cutoff 2 vs 3:  5.31938828059751e-05


With a FDR threshold of 0.05 and accounting for relative abundance, there are now 53 differentially expressed genes between samples 1 and 2, 0 between samples 1 and 3, and 51 between samples 2 and 3. The corresponding p-values are also shown above. The new counts are less than the ones obtained without accounting for relative abundance. This makes sense as we eliminate the "discoveries" that were caused by inadvertant changes in relative abundance. With this complete pipeline, we are now even more certain that most of our discoveries are true positives.

Making additional connections to biology, it is of note that biological data is often noisy and even though both samples 1 and 3 are considered wild types, comparing each of them to the mutant sample 2 yields slightly different results. Comparing the mutant to sample 1 suggests that there are 53 differentially expressed genes while there are only 51 comparing the mutant to sample 3. Here are some exploratory analyses:

In [76]:
# differentially expressed genes concluded in both 1 vs 2 and 2 vs 3
len(list(set(gene_names12_norm[0:53])&set(gene_names23_norm[0:51])))

50

There are 50 genes that we conclude are differentially expressed in the mutant compared to both of the wild type samples. 

In [78]:
# compare differences
list(set(gene_names12_norm[0:53])-set(gene_names23_norm[0:51]))

['ACOXL', 'DOK5', 'ITPR1']

The 'ACOXL', 'DOK5', 'ITPR1' genes were determined as differentially expressed when comparing the mutant sample to sample 1 but this was not concluded when comparing to sample 3. It is of note that these genes corresponded to the 3 FDR values right up to the FDR<0.05 threshold. 

In [82]:
# compare differences in the other direction
list(set(gene_names23_norm[0:51])-set(gene_names12_norm[0:53]))

['AKAP8L']

The 'AKAP8L' gene was determined as differentially expressed when comparing the mutant sample to sample 3 but this was not concluded when comparing to sample 1. Again, this gene corresponded to the last accepted FDR value right up to the FDR<0.05 threshold. All four of these genes that only belong to one of the two comparisons are right on the edge, potentially from technical or biological noise, or even batch effects commonly found in RNA seq data. Compared to the 50 genes that were found in both comparisons, I am less confident in concluding that these 4 genes were differentially expressed in the mutant compared to the wild type. 