Permalink
Newer
Older
100644 348 lines (185 sloc) 14.6 KB
Aug 9, 2016 @wpoehlm Update README.md
1
Feb 26, 2016 @feltus Initial commit
2 # OSG-GEM
Feb 21, 2017 @wpoehlm Update README.md
3 OSG-GEM is a Pegasus workflow that utilizes Open Science Grid (OSG) resources to produce a Gene Expression Matrix (GEM) from DNA sequence files in FASTQ format. The workflow is also configured to run on Jetstream
4
Mar 15, 2017 @wpoehlm Update README.md
5 ### Citation
Mar 15, 2017 @wpoehlm Update README.md
6
Feb 21, 2017 @wpoehlm Update README.md
7 William L. Poehlman, Mats Rynge, Chris Branton, D. Balamurugan and Frank A. Feltus. OSG-GEM: Gene Expression Matrix Construction Using the Open Science Grid. *Bioinformatics and Biology Insights* 2016:10 133–141 doi: 10.4137/BBI.S38193.
Apr 27, 2016 @rynge Moved to new version in OASIS, added ref prefix functionallity
8
9 ## Introduction
10
Feb 21, 2017 @wpoehlm Update README.md
11 This workflow processes paired or single end FASTQ files to produce a matrix of normalized RNA molecule counts (FPKM). OSG-GEM also supports direct input downloads from NCBI SRA for processing. An indexed reference genome along with gene model annotation files must be obtained prior to configuring
May 27, 2016 @wpoehlm Update Documentation
12 and running the workflow.
13 The following tasks are directed by the Pegasus workflow manager:
Apr 27, 2016 @rynge Moved to new version in OASIS, added ref prefix functionallity
14
Jun 6, 2016 @feltus Update README.md
15 * Splitting input FASTQ files into files containing 20,000 sequences each.
16 * Trimming raw sequences with Trimmomatic
17 * Aligning reads to the reference genome using Hisat2 or Tophat2
18 * Merging BAM alignment files into a single sorted BAM file using Samtools
19 * Quantifying RNA transcript levels using StringTie or Cufflinks
Apr 27, 2016 @rynge Moved to new version in OASIS, added ref prefix functionallity
20
May 27, 2016 @wpoehlm Update Documentation
21 It is suggested that the user become familiar with the documentation associated with the following software packages:
Apr 27, 2016 @rynge Moved to new version in OASIS, added ref prefix functionallity
22
May 27, 2016 @nawatts Format links
23 * [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic)
24 * [Hisat2](https://ccb.jhu.edu/software/hisat2/manual.shtml)
25 * [Tophat2](https://ccb.jhu.edu/software/tophat/manual.shtml)
26 * [Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml)
27 * [Samtools](http://www.htslib.org/doc/samtools.html)
28 * [StringTie](https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual)
29 * [Cufflinks](http://cole-trapnell-lab.github.io/cufflinks/manual/)
May 27, 2016 @wpoehlm Update Documentation
30
May 27, 2016 @rynge Added some OSG details
31 ## Open Science Grid / Execution Environment
May 27, 2016 @wpoehlm Update Documentation
32
May 27, 2016 @rynge Added some OSG details
33 The OSG-GEM workflow is designed to execute on the [Open Science Grid](https://www.opensciencegrid.org/) via the
34 [OSG Connect](https://osgconnect.net/) infrastructure. Access to the system can be requested on the
35 [sign up page](https://osgconnect.net/signup).
36
Sep 30, 2016 @rynge Improved documentation regarding ssh key setup
37 Once you have an account and have joined a project, login to the login02 submit node.
38 This node can be accessed by ssh:
39
40 ssh username@login02.osgconnect.net
41
42 A workflow specific ssh key has to be created. This key is used for some of the data staging steps of the workflow.
May 27, 2016 @rynge Added some OSG details
43
44 $ mkdir -p ~/.ssh
45 $ ssh-keygen -t rsa -b 2048 -f ~/.ssh/workflow
46    (just hit enter when asked for a passphrase)
Sep 30, 2016 @spficklin Update README.md
47
Aug 9, 2016 @wpoehlm Update README.md
48
Jun 6, 2017 @wpoehlm Update README.md
49 One you have the ~/.ssh/workflow.pub file, add it to your profile as described in https://support.opensciencegrid.org/support/solutions/articles/12000027675-generate-ssh-key-pair-and-add-the-public-key-to-your-account#step-2-add-the-public-ssh-key-to-login-node
Sep 30, 2016 @rynge Improved documentation regarding ssh key setup
50
May 27, 2016 @wpoehlm Update Documentation
51
Sep 30, 2016 @rynge Improved documentation regarding ssh key setup
52
May 27, 2016 @wpoehlm Update Documentation
53
54 ## Example Workflow Setup
55
May 27, 2016 @nawatts Format links
56 The worklow cloned from github contains an example config file as well as example input files from the 21st chromosome of Gencode Release 24 of the GRCh38 build of the human reference genome. Two small FASTQ files containing
Mar 15, 2017 @wpoehlm Update README.md
57 200,000 sequences from NCBI dataset SRR1825962 lie within the _Test_data_ directory of the workflow. To run the test workflow, the user must copy the _osg-gem.config.template_ file:
May 27, 2016 @wpoehlm Update Documentation
58
59 $ cp osg-gem.conf.template osg-gem.conf
May 27, 2016 @nawatts Format links
60
May 27, 2016 @wpoehlm Update README.md
61 The workflow, configured to run Hisat2 and Stringtie, can then be launched by running:
May 27, 2016 @wpoehlm Update Documentation
62
63 $ ./submit
May 27, 2016 @nawatts Format links
64
May 4, 2017 @wpoehlm Added support for STAR read alignment software
65 From here, the user may follow our documentation to modify the software options as well as point to their own input datasets. Note that there are no test reference genome indices available for STAR, because they are too large to upload to github.
May 27, 2016 @wpoehlm Update Documentation
66
67
68
69 ## Pre-Workflow User Input
70
71 The user must provide indexed reference genome files as well as gene model annotation information prior to submitting the workflow. The user must select a reference prefix($REF_PREFIX) that will be recognized by
72 Pegasus as well as by Hisat2 or Tophat2. In addition, information about splice sites or a reference transcriptome must be provided in order to guide accurate mapping of split input files. Once the user has downloaded a reference genome fasta
73 file and gene annotation in GTF/GFF3 format, the following commands can be used to produce the necessary input files, using GRCh38 as an example $REF_PREFIX for Gencode Release 24 of the human reference genome:
74
75 ### If the user would like to use Hisat2:
76
Mar 15, 2017 @wpoehlm Update README.md
77 #### Index the reference genome
May 27, 2016 @wpoehlm Update Documentation
78
79 $ hisat2-build -f GRCh38.fa GRCh38
80
Mar 15, 2017 @wpoehlm Update README.md
81 #### Generate Tab delimited list of splice sites using gene model GTF file as input (Python DefaultDictionary Module necessary)
May 27, 2016 @wpoehlm Update Documentation
82
83 $ python hisat2_extract_splice_sites.py GRCh38-gencode.v24.annotation.gtf > GRCh38.Splice_Sites.txt
May 27, 2016 @nawatts Format links
84
May 27, 2016 @wpoehlm Update Documentation
85 ### If the user would like to use Tophat2:
86
Mar 15, 2017 @wpoehlm Update README.md
87 #### Index the reference genome
May 27, 2016 @wpoehlm Update Documentation
88
89 $ bowtie2-build GRCh38.fa GRCh38
May 27, 2016 @nawatts Format links
90
Mar 15, 2017 @wpoehlm Update README.md
91 #### Generate and Index Reference Transcriptome
May 27, 2016 @wpoehlm Update Documentation
92
93 $ tophat2 -G GRCh38.gencode.v24.annotation.gff3 --transcriptome-index=transcriptome_data/GRCh38 GRCh38
94 $ tar czf GRCh38.transcriptome_data.tar.gz transcriptome_data/
Aug 9, 2016 @wpoehlm Update README.md
95
May 4, 2017 @wpoehlm Added support for STAR read alignment software
96 ### If the user would like to use STAR:
97 # cd into the "star_index" directory within "reference". Place all genome files here
May 27, 2016 @wpoehlm Update Documentation
98
May 4, 2017 @wpoehlm Added support for STAR read alignment software
99 $ cd reference/star_index
100 $ STAR-2.5.2b/bin/Linux_x86_64_static/STAR --runMode genomeGenerate --runThreadN 4 --genomeDir ./ --genomeFastaFiles ./GRCh38.fa --sjdbGTFfile ./GRCh38.gencode.v24.annotation.gff3
May 27, 2016 @wpoehlm Update Documentation
101
May 4, 2017 @wpoehlm Added support for STAR read alignment software
102 #### Generate Tab delimited list of splice sites using gene model GTF file as input (Python DefaultDictionary Module necessary)
103
104 $ python hisat2_extract_splice_sites.py GRCh38-gencode.v24.annotation.gtf > GRCh38.Splice_Sites.txt
May 27, 2016 @wpoehlm Update Documentation
105
106
107 ## Workflow Configuration
108 Once the user has obtained necessary input files, the _osg-gem.config_ file must be appropriately modified and reference files must be placed into the _reference_ directory with appropriate filenames.
109
110 ### Place Files in _reference_ directory
111
Mar 15, 2017 @wpoehlm Update README.md
112 #### If the user would like to use Hisat2, the following files must be present in the _reference_ directory:
May 27, 2016 @wpoehlm Update Documentation
113 $REF_PREFIX.fa
114
115 $REF_PREFIX.1.ht2 … $REF_PREFIX.N.ht2
116
117 $REF_PREFIX.Splice_Sites.txt
118
119 $REF_PREFIX.gff3
120
Mar 15, 2017 @wpoehlm Update README.md
121 #### If the user would like to use Tophat2, the following files must be present in the _reference_ directory:
May 27, 2016 @wpoehlm Update Documentation
122
123 $REF_PREFIX.fa
124
125 $REF_PREFIX.1.bt2 … $REF_PREFIX.N.bt2
126
127 $REF_PREFIX.rev.1.bt2
128
129 $REF_PREFIX.rev.2.bt2
130
131 $REF_PREFIX.transcriptome_data.tar.gz
132
133 $REF_PREFIX.gff3
134
May 4, 2017 @wpoehlm Added support for STAR read alignment software
135 #### If the user would like to use, STAR, the following files must be present in the _reference/star_index_ directory:
136
137 chrLength.txt
138 chrNameLength.txt
139 chrName.txt
140 chrStart.txt
141 Genome
142 genomeParameters.txt
143 Log.out
144 SA
145 SAindex
146 *Splice_Sites.txt
147
Aug 9, 2016 @wpoehlm Update README.md
148 ### User Input Datasets
149
150 OSG-GEM supports the processing of multiple input datasets into a single Gene Expression Matrix(GEM). The user
Feb 21, 2017 @wpoehlm Update README.md
151 may point to paired or single end FASTQ files on an OSG filesystem, or simply specify NCBI Sequence Read Archive (SRA)
152 ID's that they would like to process. A blend of FASTQ files on OSG, as well as SRA ID's may be provided. Please note, however, that a
153 mixture of single end and paired end reads can not be used. The user must select *either* Paired end or Single end reads
Aug 9, 2016 @wpoehlm Update README.md
154
155
156 Each line in the config file can either be a pair of forward and reverse files, separated by a space:
157
158 input1 = forward.fastq.gz reverse.fastq.gz
159
Feb 21, 2017 @wpoehlm Update README.md
160 Or a single fastq file (for single end reads):
161
162 input1 = test.fastq.gz
163
Aug 9, 2016 @wpoehlm OSG-GEM Update
164 Or a single SRA ID:
Aug 9, 2016 @wpoehlm Update README.md
165
166 input2 = DRR046893
Feb 21, 2017 @wpoehlm Update README.md
167
Aug 9, 2016 @wpoehlm Update README.md
168
May 27, 2016 @wpoehlm Update Documentation
169
170
171 ### Modify _osg-gem.config_ file
172
173 #### Specify reference prefix that matches filenames in the _reference_ directory
174
175 [reference]
176
177 reference_prefix = $REF_PREFIX
178
Feb 21, 2017 @wpoehlm Update README.md
179 #### Specify file paths to FASTQ files for a given dataset($DATASET)
May 27, 2016 @wpoehlm Update Documentation
180
181 [inputs]
182
183
Feb 21, 2017 @wpoehlm Update README.md
184 input1 = /path_to_forward_data/TEST_1.fastq.gz ./path_to_reverse_data/TEST_2.fastq.gz or SRAID or ./path_to_fastq/TEST.fastq.gz
Aug 9, 2016 @wpoehlm OSG-GEM Update
185
Feb 21, 2017 @wpoehlm Update README.md
186 input2 = /path_to_forward_data/TEST2_1.fastq.gz ./path_to_reverse_data/TEST2_2.fastq.gz or SRAID or ./patch_to_fastq/TEST2.fastq.gz
May 27, 2016 @wpoehlm Update Documentation
187
188
Feb 21, 2017 @wpoehlm Update README.md
189 #### Select software and read layout options
May 27, 2016 @wpoehlm Update Documentation
190
191 [config]
192
Feb 21, 2017 @wpoehlm Update README.md
193 single = 'True' or 'False'
194
195 paired = 'True' or 'False'
196
May 27, 2016 @wpoehlm Update Documentation
197 tophat2 = 'True' or 'False'
198
199 hisat2 = 'True' or 'False'
200
201 cufflinks = 'True' or 'False'
202
203 stringtie = 'True' or 'False'
204
205
206 #### Example _osg-gem.config_ file:
207
Feb 21, 2017 @wpoehlm Update README.md
208 If a user cloned OSG-GEM into '/stash2/user/username/GEM_test', and placed input paired-end FASTQ files for dataset 'TEST' in '/stash2/user/username/Data'. To process this dataset, along with dataset DRR046893 from NCBI SRA, using Hisat2 and StringTie with the GRCh38 build of the human reference genome, the osg-gem.config file would be modified as follows:
May 27, 2016 @wpoehlm Update Documentation
209
210 [reference]
211
212 reference_prefix = GRCh38
213
214 [inputs]
215
216
Aug 9, 2016 @wpoehlm Update README.md
217 input1 = ./Test_data/TEST_1.fastq.gz ./Test_data/TEST_2.fastq.gz
Aug 9, 2016 @wpoehlm OSG-GEM Update
218
Aug 9, 2016 @wpoehlm Update README.md
219 input2 = DRR046893
220
May 27, 2016 @wpoehlm Update Documentation
221
222 [config]
223
Feb 21, 2017 @wpoehlm Update README.md
224 single = False
225
226 paired = True
227
May 27, 2016 @wpoehlm Update Documentation
228 tophat2 = False
229
230 hisat2 = True
231
232 cufflinks = False
233
234 stringtie = True
235
236
237 ## Monitoring Workflow
238
239 Pegasus provides a set of commands to monitor workflow progress. The path to the worklow files as well as commands to monitor the workflow will print to screen upon submitting the workflow. For example:
240
May 27, 2016 @rynge Added some OSG details
241 2016.05.26 23:31:03.859 CDT: Your workflow has been started and is running in
242 2016.05.26 23:31:03.869 CDT: /stash2/user/username/workflows/osg-gem-x/workflow/osg-gem-x
243 2016.05.26 23:31:03.880 CDT: *** To monitor the workflow you can run ***
244
245 2016.05.26 23:31:03.891 CDT: pegasus-status -l /stash2/user/username/workflows/osg-gem-x/workflow/osg-gem-x
246
247 2016.05.26 23:31:03.901 CDT: *** To remove your workflow run ***
248
249 2016.05.26 23:31:03.912 CDT: pegasus-remove /stash2/user/username/workflows/osg-gem-x/workflow/osg-gem-x
May 27, 2016 @wpoehlm Update Documentation
250
251
252 Output will be transferred at the base of this directory upon completion. For example:
253
May 27, 2016 @rynge Added some OSG details
254 $ cd /stash2/user/username/workflows/osg-gem-x
255 $ ls
256 data outputs scratch workflow
May 27, 2016 @wpoehlm Update Documentation
257
May 27, 2016 @rynge Added some OSG details
258 $ cd outputs
259 $ head -1 TEST-merged_counts.fpkm
260 TranscriptID FPKM
May 27, 2016 @wpoehlm Update Documentation
261
262
263 ## User Modifications to Workflow
May 27, 2016 @rynge Added some OSG details
264
May 28, 2016 @wpoehlm Update README.md
265 To customize OSG-GEM parameters, basic understanding of the directory structure of the workflow is necessary.
266
267 ### Workflow Directory Structure
268
Mar 15, 2017 @wpoehlm Update README.md
269 #### Test_data
May 28, 2016 @wpoehlm Update README.md
270
271 This contains small FASTQ files for testing. The user may place their own data in this directory or elsehwere on the OSG filesystems.
272
Mar 15, 2017 @wpoehlm Update README.md
273 #### reference
May 28, 2016 @wpoehlm Update README.md
274
275 Contains all reference genome and annotation files, as described previously.
276
Mar 15, 2017 @wpoehlm Update README.md
277 #### Tools
May 28, 2016 @wpoehlm Update README.md
278
May 28, 2016 @wpoehlm Update README.md
279 This directory contains job wrappers for each step of the workflow. It is suggested that the user becomes familiar with the parameters set for each software to determine if they would like to make changes. If the user would like to change software parameters, they may modify the commands in the files here. Note that any changes to input filenames in the commands must match the files that are catalogued in the _task-files_ directory (explained below)
May 28, 2016 @wpoehlm Update README.md
280
Mar 15, 2017 @wpoehlm Update README.md
281 #### task-files
May 28, 2016 @wpoehlm Update README.md
282
283 This directory contains subdirectories for each job that utilizes specific files(eg., python script to parse StringTie output, fasta_adapters.txt file for trimmomatic).
284
285 Any files placed in these directories will be transferred to OSG compute nodes for the corresponding jobs. For example, if the user would like to use a different fasta adapters file 'NewAdapters.txt' for read trimming for the hisat2 job, they would copy this file to the _hisat2_ directory. Note that the job wrapper in the _tools_ directory must now be modified to match this filename.
286
Mar 15, 2017 @wpoehlm Update README.md
287 #### useful_files
May 31, 2016 @wpoehlm Update README.md
288
May 31, 2016 @wpoehlm Update README.md
289 Contains files that may be useful to users of this workflow. Currently holds the hisat2_extract_splice_sites.py script that comes with the Hisat2 software package. This script can be used to generate a tab delimited list of splice sites from a GTF gene model file.
May 31, 2016 @wpoehlm Update README.md
290
May 28, 2016 @wpoehlm Update README.md
291 #### Base directory
292
293 The base directory of the worfklow contains the _osg-gem.config_ file, the _submit_ script, and a pegasus configuration file.
294
295 The execution environment is catalogued in the _submit_ script, allowing the user to alter the resources requested by the workflow.
296
297 For example:
298
May 28, 2016 @wpoehlm Update README.md
299 <site handle="condorpool" arch="x86_64" os="LINUX">
300 <profile namespace="pegasus" key="style" >condor</profile>
301 <profile namespace="condor" key="universe" >vanilla</profile>
302 <profile namespace="condor" key="requirements" >OSGVO_OS_STRING == "RHEL 6" &amp;&amp; HAS_MODULES == True &amp;&amp; HAS_SCP == True &amp;&amp; GLIDEIN_ResourceName != "Hyak" &amp;&amp; TARGET.GLIDEIN_ResourceName =!= MY.MachineAttrGLIDEIN_ResourceName1 &amp;&amp; TARGET.GLIDEIN_ResourceName =!= MY.MachineAttrGLIDEIN_ResourceName2 &amp;&amp; TARGET.GLIDEIN_ResourceName =!= MY.MachineAttrGLIDEIN_ResourceName3 &amp;&amp; TARGET.GLIDEIN_ResourceName =!= MY.MachineAttrGLIDEIN_ResourceName4</profile>
303 <profile namespace="condor" key="request_memory" >5 GB</profile>
304 <profile namespace="condor" key="request_disk" >30 GB</profile>
305 <profile namespace="condor" key="+WantsStashCache" >True</profile>
306 </site>
307
308 By default, the workflow is cloned with requests for at least 5 GB of memory and 30 GB of disk space on OSG compute nodes. If the user is working with an organism with a large reference genome and finds that 5 GB is insufficient, they may change:
309
310 <profile namespace="condor" key="request_memory" >5 GB</profile>
311
May 28, 2016 @wpoehlm Update README.md
312 to request six gigabytes of RAM:
May 28, 2016 @wpoehlm Update README.md
313
314 <profile namespace="condor" key="request_memory" >6 GB</profile>
315
316
May 28, 2016 @wpoehlm Update README.md
317 If the user finds that 5 gigabytes of RAM per job is unnecessary and would like to speed up queue times, they may change:
May 28, 2016 @wpoehlm Update README.md
318
May 28, 2016 @wpoehlm Update README.md
319 <profile namespace="condor" key="request_memory" >5 GB</profile>
May 28, 2016 @wpoehlm Update README.md
320
May 28, 2016 @wpoehlm Update README.md
321 to request only 3 gigabytes of RAM per job:
May 28, 2016 @wpoehlm Update README.md
322
323 <profile namespace="condor" key="request_memory" >3 GB</profile>
324
Mar 15, 2017 @wpoehlm Update README.md
325 ### Interchanging Software
May 28, 2016 @wpoehlm Update Documentation
326
327 This workflow utilizes OASIS software modules that OSG compute nodes can access. Job wrappers in this workflow load these modules to utilize specific versions of software. For example, the following software modules are loaded for all _tophat_ jobs using the 'module load' command:
328
329 module load tophat/2.1.1
330 module load samtools/1.3.1
331 module load bowtie/2.2.9
332 module load java/8u25
333
334 If the user would like to plug in alternate software, or would like to use a different version of the available software, an osgconnect user support ticket may be submited to have their software of choice installed as an OASIS module.
335
336 We have also found that precompiled software packages for linux x86_64 architecture have been stable on OSG compute nodes. The user may utilize these software packages by adding a tar archive of the package to the appropriate task-files directory of the workflow. This will then be transferred as input to the job, which can be unpacked and utilized for the user's task.
337
May 28, 2016 @wpoehlm Update README.md
338
339
May 28, 2016 @wpoehlm Update README.md
340
341
342
343
344
345
May 28, 2016 @wpoehlm Update README.md
346
May 28, 2016 @wpoehlm Update README.md
347