Skip to content
Reproducing the MOUTHWASH simulations from Gerard and Stephens (2017)
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
Code
Data
Output
R
.gitignore
LICENSE
Makefile
NOTES.txt
README.Rmd
README.md
mouthwash.sbatch
readme_bib.bib
test.backwash.R

README.md

Source to reproduce results from Gerard and Stephens (2017)

DOI

This repository contains source code to reproduce the empirical evaluations of Gerard and Stephens (2017). The new methods can be found in the vicar package.

If you find a bug, please create an issue.

Citing this work

If you find any of the source code in this repository useful for your work, please cite our paper:

Gerard, David, and Matthew Stephens. 2017. "Empirical Bayes Shrinkage and False Discovery Rate Estimation, Allowing for Unwanted Variation." arXiv Preprint arXiv:1709.10066. https://arxiv.org/abs/1709.10066.

License

Copyright (c) 2017-2018, David Gerard.

All source code and software in this repository are made available under the terms of the GNU General Public License. See the LICENSE file for the full text of the license.

Instructions

To reproduce the results of Gerard and Stephens (2017), you need to (i) install the appropriate R packages, (ii) obtain the appropriate data, (iii) run make and (iv) get some coffee while you wait. Please read below for details on each of these steps.

Install software and R packages

  1. Install R.
  2. Install GNU Make.
  3. Install the required R packages by running the following commands in the R interactive environment. (Note the order of these commands is important---the Bioconductor packages should be installed before the CRAN packages.)
source("https://bioconductor.org/biocLite.R")
biocLite(c("sva", "limma", "qvalue", "biomaRt"), suppressUpdates = TRUE)
install.packages(c("tidyverse", "stringr", "reshape2", "pROC",
                   "ruv", "cate", "devtools", "ashr", "bfa",
                   "xtable", "dplyr", "ggthemes",
                   "assertthat", "R.utils"))
devtools::install_github("dcgerard/seqgendiff")
devtools::install_github("dcgerard/vicar")

If the newer versions of seqgendiff and vicar do not work, then you can try the specific commits I used for publication:

devtools::install_github("dcgerard/vicar", ref = "41fc3df8faed8d59af1f29410619908385970124")
devtools::install_github("dcgerard/seqgendiff", ref = "680e088c1a37879b82b7db86b24fbd657b5f6869")

See below for the versions of the packages that I used for the paper.

Get data

Download the following files from the GTEx Portal and copy them to the Data subdirectory. In order to replicate my results, it is important to use version "6p" of the GTEx data. To access these files, you will need to register for a GTEx Portal account if you have not done so already.

  1. GTEx_Data_V6_Annotations_SampleAttributesDS.txt

  2. GTEx_Analysis_v6p_RNA-seq_RNA-SeQCv1.1.8_gene_reads.gct.gz

  3. gencode.v19.genes.v6p_model.patched_contigs.gtf.gz

  4. GTEx_Data_V6_Annotations_SubjectPhenotypesDS.txt

    Next, download the list of human housekeeping genes from Eisenberg and Levanon (2013) and the NCBI-to-Ensembl gene mapping file, and copy these files to the Data directory:

  5. http://www.tau.ac.il/~elieis/HKG/HK_genes.txt

  6. ftp://ftp.ncbi.nih.gov/gene/DATA/gene2ensembl.gz

    Finally, download the list of human housekeeping genes from Lin et al. (2017) by navigating to their Shiny R application, clicking on "Default Values", then immediately clicking "Download". Then place this file, labeled "h-scHKgenes.csv", in the Data directory.

  7. h-scHKgenes.csv

    I used the file get_ensembl.R to correspond HGNC gene names to their ensembl annotation, but I don't run this in make because I don't want a dependency on the biomaRt package, so I've included the resulting file lin_hk_genes.csv in the Data directory.

After completing these steps, the Data folder should look like this:

cd Data
ls -1
## gencode.v19.genes.v6p_model.patched_contigs.gtf.gz
## gene2ensembl.gz
## GTEx_Analysis_v6p_RNA-seq_RNA-SeQCv1.1.8_gene_reads.gct.gz
## GTEx_Data_V6_Annotations_SampleAttributesDS.txt
## GTEx_Data_V6_Annotations_SubjectPhenotypesDS.txt
## HK_genes.txt
## h-scHKgenes.csv
## lin_hk_genes.csv

Run make

To reproduce all of the results in Gerard and Stephens (2017), move to the root directory in your local copy of this repository, and run make from the terminal. This will run all the steps in the data processing and analysis pipeline. Note that you may need to adjust some of the settings in the Makefile to suit your computing environment.

Since some of the steps take a long time to run, you may not want to run everything at once. If you want to reproduce the simulation results, run

make sims

If you want to reproduce just the results of the GTEx analysis using the control genes from Eisenberg and Levanon (2013), run

make gtex_analysis

If you want to reproduce just the results of the GTEx analysis using the control genes from Lin et al. (2017), run

make gtex_analysis_lin

If you want to reproduce the computation time calculations, run

make computation

If you want to reproduce the figure in the introduction, run

make one_data

At any point during or after the commands are running, it may be helpful to inspect the .Rout files saved in the Output subdirectory.

Get Coffee

All of these runs (except the last one) should take a very long time (a day to a couple of days). You should get some coffee. Here is a list of some of my favorite places:

If you are having trouble reproducing these results, it might be that you need to update some of your R packages. These are the versions that I used (including some versions of packages that are not actually needed to run the code):

sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
## 
## Matrix products: default
## BLAS: /usr/local/lib/R/lib/libRblas.so
## LAPACK: /usr/local/lib/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] vicar_0.1-9       seqgendiff_0.1.0  biomaRt_2.26.1   
##  [4] qvalue_2.2.2      limma_3.26.9      sva_3.18.0       
##  [7] genefilter_1.52.1 mgcv_1.8-23       nlme_3.1-137     
## [10] R.utils_2.6.0     R.oo_1.22.0       R.methodsS3_1.7.1
## [13] assertthat_0.2.0  ggthemes_3.5.0    xtable_1.8-2     
## [16] bfa_0.4           ashr_2.2-7        devtools_1.13.5  
## [19] cate_1.0.4        ruv_0.9.7         pROC_1.12.1      
## [22] reshape2_1.4.3    forcats_0.3.0     stringr_1.3.1    
## [25] dplyr_0.7.4       purrr_0.2.4       readr_1.1.1      
## [28] tidyr_0.8.0       tibble_1.4.2      ggplot2_2.2.1    
## [31] tidyverse_1.2.1  
## 
## loaded via a namespace (and not attached):
##  [1] svd_0.4.1            bitops_1.0-6         lubridate_1.7.4     
##  [4] bit64_0.9-7          doParallel_1.0.11    httr_1.3.1          
##  [7] rprojroot_1.3-2      tools_3.5.0          backports_1.1.2     
## [10] R6_2.2.2             DBI_1.0.0            lazyeval_0.2.1      
## [13] BiocGenerics_0.16.1  colorspace_1.3-2     withr_2.1.2         
## [16] gridExtra_2.3        mnormt_1.5-5         bit_1.1-12          
## [19] compiler_3.5.0       cli_1.0.0            rvest_0.3.2         
## [22] Biobase_2.30.0       xml2_1.2.0           scales_0.5.0        
## [25] SQUAREM_2017.10-1    psych_1.8.4          esaBcv_1.2.1        
## [28] digest_0.6.15        foreign_0.8-70       rmarkdown_1.9       
## [31] pscl_1.5.2           pkgconfig_2.0.1      htmltools_0.3.6     
## [34] rlang_0.2.0          readxl_1.1.0         rstudioapi_0.7      
## [37] RSQLite_2.1.1        bindr_0.1.1          jsonlite_1.5        
## [40] leapp_1.2            RCurl_1.95-4.10      magrittr_1.5        
## [43] Matrix_1.2-14        Rcpp_0.12.16         munsell_0.4.3       
## [46] S4Vectors_0.8.11     stringi_1.2.2        yaml_2.1.19         
## [49] MASS_7.3-50          plyr_1.8.4           grid_3.5.0          
## [52] blob_1.1.1           parallel_3.5.0       crayon_1.3.4        
## [55] lattice_0.20-35      haven_1.1.1          splines_3.5.0       
## [58] annotate_1.48.0      hms_0.4.2            knitr_1.20          
## [61] pillar_1.2.2         corpcor_1.6.9        codetools_0.2-15    
## [64] stats4_3.5.0         XML_3.98-1.11        glue_1.2.0          
## [67] evaluate_0.10.1      modelr_0.1.2         foreach_1.4.4       
## [70] cellranger_1.1.0     gtable_0.2.0         broom_0.4.4         
## [73] coda_0.19-1          survival_2.42-3      truncnorm_1.0-8     
## [76] iterators_1.0.9      AnnotationDbi_1.32.3 memoise_1.1.0       
## [79] IRanges_2.4.8        bindrcpp_0.2.2

As you can see, I've also only tried this out on Ubuntu.

Credits

This project was developed by David Gerard at the University of Chicago.

Peter Carbonetto made many fantastic contributions and suggestions to increase the reproducibility of this work.

Thanks to Matthew Stephens for his support and mentorship.

References

Eisenberg, Eli, and Erez Y Levanon. 2013. “Human Housekeeping Genes, Revisited.” Trends in Genetics 29 (10). Elsevier: 569–74. doi:10.1016/j.tig.2013.05.010.

Gerard, David, and Matthew Stephens. 2017. “Unifying and Generalizing Methods for Removing Unwanted Variation Based on Negative Controls.” arXiv Preprint arXiv:1705.08393. https://arxiv.org/abs/1705.08393.

Lin, Yingxin, Shila Ghazanfar, Dario Strbenac, Andy Wang, Ellis Patrick, Terence Speed, Jean Yang, and Pengyi Yang. 2017. “Housekeeping Genes, Revisited at the Single-Cell Level.” bioRxiv. Cold Spring Harbor Laboratory. doi:10.1101/229815.

You can’t perform that action at this time.