/
bootstrap-cluster.Rmd
184 lines (119 loc) · 6.77 KB
/
bootstrap-cluster.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
---
title: "Using HTCondor for bootstrap analysis"
author: "Frederick Boehm"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Using HTCondor for bootstrap analysis}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
## Overview
Performing a bootstrap analysis with `qtl2pleio` requires first sampling a (bivariate) phenotype from the inferred bivariate normal distribution, then a two-dimensional, two-QTL QTL scan over the defined scan region. Often, one wants to include more than 100 markers in a scan region. Each two-dimensional scan can take tens of minutes with a single core. Thus, performing a bootstrap analysis with, say, 1000 bootstrap samples is prohibitively time-consuming on most computers.
For this reason, we used a high-throughput computing cluster at the University of Wisconsin-Madison (where `qtl2pleio` was developed). The UW-Madison Center for High-Throughput Computing (CHTC) is available to all UW-Madison researchers. While we recognize that `qtl2pleio` users may not have access to the CHTC, we anticipate that some will have access to computing clusters. Hopefully this vignette can be adapted to your needs.
## Directory structure, files & setup for using Condor
We placed on github a repository that contains the results from our bootstrap analysis of the Recla data.
Here is the repository's url: https://github.com/fboehm/qtl2pleio-manuscript-chtc/
Within this repo are three subdirectories. We'll examine in detail the "Recla-bootstrap" subdirectory.
Within this directory, we created four subdirectories:
1. Rscript
1. data
1. shell_scripts
1. submit_files
The Github repository also contains a fifth subdirectory, `squid` and a sixth subdirectory `results`, that are included for completeness. In practice, files in `squid` were in a distinct directory. We describe it below.
### Files in `Rscript`
The `Rscript` subdirectory contains a file, `boot-Recla-10-22.R` that contains R code for our bootstrap analysis.
```{r}
fn <- file.path("https://raw.githubusercontent.com",
"fboehm/qtl2pleio-manuscript/master/chtc",
"Recla-bootstrap/Rscript/boot-Recla-10-22.R"
)
foo <- readLines(fn)
```
Here we display the contents of our `boot-Recla-10-22.R` file.
```{r, echo = FALSE, comment = ""}
cat(foo, sep = "\n")
```
### Files in `data`
The data subdirectory contains three rds files, one for each of the three input components.
1. founder allele dosages for Chr 8
2. phenotypes
3. kinship matrix (derived via LOCO method)
### Files in `shell_scripts`
We examine only the file that is needed for the Recla analysis, `boot-Recla-10-22.sh`.
Here is the contents of the file.
```{bash, eval = FALSE}
#!/bin/bash
# untar your R installation
tar -xzf R13.tar.gz
tar -xzf SLIBS.tar.gz
# make sure the script will use your R installation
export PATH=$(pwd)/R/bin:$PATH
export LD_LIBRARY_PATH=$(pwd)/SS:$LD_LIBRARY_PATH
# run R, with the name of your R script
R CMD BATCH '--args argname='$1' nsnp='$2' s1='$3' nboot_per_job='$4' run_num='$5'' Rscript/boot-Recla-10-22.R 'boot400_run'$5'-'$1'.Rout'
```
We need to unzip the R installation, which we've named R13.tar.gz, and SLIBS file on the remote computer where the actual computing will occur. Following that, and needed adjustments to the paths, we execute R. Note that we have, in this file, specified the variable values by numbers, like `$1` for `argname`. The values assigned to each of these is specified in the "submit" file.
### Files in `submit_files`
The HTCondor "submit" file provides instructions for controlling the interaction with the high-throughput resources. Below is the text in the submit file `boot-Recla-10-22.sub`
```{bash, eval = FALSE}
# hello-chtc.sub
# My very first HTCondor submit file
s1=650
# start scan at this SNP index value
nsnp=350
# length of scan in number of SNPs
# which run in the experimental design?
run_num=561
#
nboot_per_job=1
# for almost all jobs), the desired name of the HTCondor log file,
# and the desired name of the standard error file.
# Wherever you see $(Cluster), HTCondor will insert the queue number
# assigned to this set of jobs at the time of submission.
universe = vanilla
log = boot-$(Process)-run$(run_num).log
error = boot-$(Process)-run$(run_num).err
#
# Specify your executable (single binary or a script that runs several
# commands), arguments, and a files for HTCondor to store standard
# output (or "screen output").
# $(Process) will be a integer number for each job, starting with "0"
# and increasing for the relevant number of jobs.
executable = ../shell_scripts/boot-Recla-10-22.sh
arguments = $(Process) $(nsnp) $(s1) $(nboot_per_job) $(run_num)
output = boot-$(Process)-run$(run_num).out
#
# Specify that HTCondor should transfer files to and from the
# computer where each job runs. The last of these lines *would* be
when_to_transfer_output = ON_EXIT
transfer_input_files = ../data,../Rscript,../shell_scripts,http://proxy.chtc.wisc.edu/SQUID/fjboehm/R13.tar.gz,http://proxy.chtc.wisc.edu/SQUID/SLIBS.tar.gz
#
# Tell HTCondor what amount of compute resources
# each job will need on the computer where it runs.
request_cpus = 1
request_memory = 3GB
request_disk = 1GB
#
requirements = (OpSysMajorVer == 6) || (OpSysMajorVer == 7)
# which computer grids to use:
+WantFlocking = true
+WantGlideIn = true
# Tell HTCondor to run instances of our job:
queue 1000
```
The ordering of the arguments on in the line that starts with `arguments = ` dictates the numbers that are assigned to each argument. These numbers are used in the shell script, above.
### Files on `SQUID`
The UW-Madison CHTC provides for each user disk space on a web proxy. More information about `SQUID` is available online.
In practice, we placed files containing founder allele dosages (for chromosomes of interest) and our (compressed) R installation on `SQUID`.
## Submitting jobs to Condor (at UW-Madison CHTC)
We followed the usual procedure for submitting R jobs with the CHTC. To submit the file `boot-Recla-10-22.sub`, we first changed into its directory. We then typed
```{bash, eval = FALSE}
condor_submit boot-Recla-10-22.sub
```
By using this command from within the `submit_files` subdirectory, we ensure that our outputted files are returned to the `submit_files` subdirectory. We then manually moved the outputted results files to the `results` subdirectory.
We found that up to ten percent of jobs would initially fail and require re-submission. There are a variety of reasons that may explain the need for re-submission. Because of this need, we include in the git repository the file `boot-Recla-10-22-fix.sub`, which gives the code that we used for resubmission. The file `bad-jobs-run561` gives the ids of the jobs that failed the initial runs.
## Session info
```{r}
devtools::session_info()
```