OSG-GEM is a Pegasus workflow that utilizes Open Science Grid (OSG) resources to produce a Gene Expression Matrix (GEM) from DNA sequence files in FASTQ format. The workflow is also configured to run on Jetstream
William L. Poehlman, Mats Rynge, Chris Branton, D. Balamurugan and Frank A. Feltus. OSG-GEM: Gene Expression Matrix Construction Using the Open Science Grid. *Bioinformatics and Biology Insights* 2016:10 133–141 doi: 10.4137/BBI.S38193.
This workflow processes paired or single end FASTQ files to produce a matrix of normalized RNA molecule counts (FPKM). OSG-GEM also supports direct input downloads from NCBI SRA for processing. An indexed reference genome along with gene model annotation files must be obtained prior to configuring
One you have the ~/.ssh/workflow.pub file, add it to your profile as described in https://support.opensciencegrid.org/support/solutions/articles/12000027675-generate-ssh-key-pair-and-add-the-public-key-to-your-account#step-2-add-the-public-ssh-key-to-login-node
The worklow cloned from github contains an example config file as well as example input files from the 21st chromosome of Gencode Release 24 of the GRCh38 build of the human reference genome. Two small FASTQ files containing
200,000 sequences from NCBI dataset SRR1825962 lie within the _Test_data_ directory of the workflow. To run the test workflow, the user must copy the _osg-gem.config.template_ file:
From here, the user may follow our documentation to modify the software options as well as point to their own input datasets. Note that there are no test reference genome indices available for STAR, because they are too large to upload to github.
The user must provide indexed reference genome files as well as gene model annotation information prior to submitting the workflow. The user must select a reference prefix($REF_PREFIX) that will be recognized by
72
Pegasus as well as by Hisat2 or Tophat2. In addition, information about splice sites or a reference transcriptome must be provided in order to guide accurate mapping of split input files. Once the user has downloaded a reference genome fasta
73
file and gene annotation in GTF/GFF3 format, the following commands can be used to produce the necessary input files, using GRCh38 as an example $REF_PREFIX for Gencode Release 24 of the human reference genome:
Once the user has obtained necessary input files, the _osg-gem.config_ file must be appropriately modified and reference files must be placed into the _reference_ directory with appropriate filenames.
If a user cloned OSG-GEM into '/stash2/user/username/GEM_test', and placed input paired-end FASTQ files for dataset 'TEST' in '/stash2/user/username/Data'. To process this dataset, along with dataset DRR046893 from NCBI SRA, using Hisat2 and StringTie with the GRCh38 build of the human reference genome, the osg-gem.config file would be modified as follows:
Pegasus provides a set of commands to monitor workflow progress. The path to the worklow files as well as commands to monitor the workflow will print to screen upon submitting the workflow. For example:
This directory contains job wrappers for each step of the workflow. It is suggested that the user becomes familiar with the parameters set for each software to determine if they would like to make changes. If the user would like to change software parameters, they may modify the commands in the files here. Note that any changes to input filenames in the commands must match the files that are catalogued in the _task-files_ directory (explained below)
This directory contains subdirectories for each job that utilizes specific files(eg., python script to parse StringTie output, fasta_adapters.txt file for trimmomatic).
284
285
Any files placed in these directories will be transferred to OSG compute nodes for the corresponding jobs. For example, if the user would like to use a different fasta adapters file 'NewAdapters.txt' for read trimming for the hisat2 job, they would copy this file to the _hisat2_ directory. Note that the job wrapper in the _tools_ directory must now be modified to match this filename.
Contains files that may be useful to users of this workflow. Currently holds the hisat2_extract_splice_sites.py script that comes with the Hisat2 software package. This script can be used to generate a tab delimited list of splice sites from a GTF gene model file.
By default, the workflow is cloned with requests for at least 5 GB of memory and 30 GB of disk space on OSG compute nodes. If the user is working with an organism with a large reference genome and finds that 5 GB is insufficient, they may change:
This workflow utilizes OASIS software modules that OSG compute nodes can access. Job wrappers in this workflow load these modules to utilize specific versions of software. For example, the following software modules are loaded for all _tophat_ jobs using the 'module load' command:
328
329
module load tophat/2.1.1
330
module load samtools/1.3.1
331
module load bowtie/2.2.9
332
module load java/8u25
333
334
If the user would like to plug in alternate software, or would like to use a different version of the available software, an osgconnect user support ticket may be submited to have their software of choice installed as an OASIS module.
335
336
We have also found that precompiled software packages for linux x86_64 architecture have been stable on OSG compute nodes. The user may utilize these software packages by adding a tar archive of the package to the appropriate task-files directory of the workflow. This will then be transferred as input to the job, which can be unpacked and utilized for the user's task.