GitHub - dclarkboucher/mediation

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
dnam_scripts		dnam_scripts
simulation_scripts		simulation_scripts
supplementary_analyses		supplementary_analyses
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
ReadMe.txt		ReadMe.txt
mediation_DNAm.Rproj		mediation_DNAm.Rproj
vcv_sparse.rda		vcv_sparse.rda

Repository files navigation

This file contains code for implementing our analysis from the manuscript
"Methods for Mediation Analysis with High-Dimensional DNA Methylation Data:
Possible Choices and Comparison"

R version should be >= 2.1.0. Details on the packages required are located
in the relevant scripts. When running R code, make sure that your working
directory is the directory where this ReadMe is stored. In RStudio, you can
do this by loading the "mediation_DNAm.Rproj" R project.

If you have questions about the code in this repository or issues implementing the
analysis, contact me at dclarkboucher@fas.harvard.edu. 

SIMULATION STUDY:
The first step in the simulation study is to generate the necessary data, which 
can be done with the R script "simulation_scrip/generate_data.R". When using 
this script, set the "ndat" parameter to be 100 to replicate the entire study 
(extremely computationally costly), or 1 to produce results for just a single 
simulated dataset in each setting (24 datasets in total). 

The second step is to implement the R scripts "simulation_scripts/pathway_lasso.R", 
"simulation_scripts/hima_hdma_medfix_pcma_hilma.R", "simulation_scripts/one-at-a-time.R", 
and "simulation_scripts/bslmm.R". Using these scripts requires installation of 
additional R CRAN and GitHub packages. Moreover, because the methods vary in 
length and there are many simulated datasets, we strongly recommend using parallel 
computing on a remote cluster, for which our scripts can be easily adapted.

The third step is to run "simulation_scripts/true_positive_rate_mse.R", which 
calculates the true positive rates for detecting active mediators and the MSE for 
estimating mediation contributions, and "simulation_scripts/percent_relative_bias.R", 
which calculates the percent relative bias in estimating the total indirect effect. 
These datasets were directly used for making manuscript tables 3-6 and supplementary
tables 1-4. 

OBSERVED DNAm DATA ANALYSIS:
Data used in this analysis can be obtained through the MESA Data Coordinating 
Center (https://www.mesanhlbi.org/). Since we cannot make MESA's data publicly 
available, we instead use a simulated dataset made to resemble the observed methylation 
data, which can be generated by running the file "dnam_scripts/generate_fake_dnam.R" 
to create a toy DNAm dataset. 

The second step is to run "dnam_scripts/fit_single_mediator_models.R" to run 
linear mixed models for screening the CpG sites down to the subset of 2,000 that 
were used in the analysis. Run "dnam_scripts/process_single_med_results.R" to process 
the output files. 

The third step is to run "dnam_scripts/regress_out_random_effects.R" to regress 
the random effect covariates out of the mediators. This is because none of the 
high-dimensional mediation methods can directly handle random effects as covariates,
whereas a few of them can handle fixed effects.

The fourth step is to run our files for implementing the methods. This can be done 
with the master script "dnam_scripts/implement_methods_master.R" which will run 
the many needed subscripts located in the folder "dnam_scripts/implement_methods"; 
or, it can be done by implementing those subscripts one-at-a-time, which may be
more practical since running them all at once would be quite slow.

Once all the methods have been run, the fifth and final step is to run the scripts
"dnam_scripts/identify_noteworthy_cpgs.R", "dnam_scripts/estimate_mediation_effect.R",
and "dnam_scripts/read_hdmm_spcma.R", which produce, respectively, manuscript
table 1 and supplementary file 1; manuscript table 2; and the results necessary
for interpreting SPCMA and HDMM.