Skip to content
Branch: master
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
GSE107655_annotation.csv
GSE107655_config.yaml
README.md

README.md

Complete RNA-Seq Project

This example will show you how to use looper to complete an RNA-Seq project -- from raw data download to differential expression analysis -- using tools that are designed for the PEP format: geofetch, rnapipe, BiocProject, and DESeqPackager.

What is PEP? PEP format is a way to organize metadata, which consists of sample file paths and attributes, such as organism, source, and extraction protocol. PEP format requires only a yaml (for project/pipeline configuration) and csv (for sample annotation), is compatible with a variety of PEP tools.

An overview of this tutorial:

  1. Download RNA-Seq data from the Gene Expression Omnibus using geofetch, which will automatically produce a PEP project_config yaml and sample_annotation csv from the data
  2. Run an RNA-Seq processing pipeline (rnaKallisto from the rnapipe repository) using looper
  3. Format the results of rnaKallisto for differential expression analysis using DESeqPackager

0. Environment Setup and Required Software

First, the computing environment has to be set up with the correct environment variables for easier use of the tools (if not, then the filepaths have to be manually specified in command line arguments). Environment variables also make it easier to change the yaml, because changing an environment variable will reflect everywhere it is used.

If you are in UVA Rivanna and you have the rivanna4 module loaded, then that's all! Otherwise, in .bash_profile or .profile, add a few environment variables. It might look something like this:

export DATA=/path/to/sradata
export SRARAW=/path/to/sradata/sra/
export SRAMETA=/path/to/sradata/sra_meta/
export SRABAM=/path/to/sradata/sra_bam/
export PROCESSED=/path/to/processed/

Now install the required software:

pip install --user geofetch

You may want to adjust where geofetch downloads data.

pip install --user https://github.com/pepkit/looper/zipball/master
git clone https://github.com/databio/rnapipe.git
devtools::install_github("pepkit/BiocProject")
git clone https://github.com/databio/DESeqPackager.git

1. Downloading raw data from GEO

We'll be using GSE107655 from the Gene Expression Omnibus (GEO). To download the data, use the tool geofetch:

geofetch -i GSE107655 --pipeline_interfaces path/to/rnapipe/pipeline_interface.yaml

Geofetch downloads SRA data into $SRARAW and also create a PEP config file (GSE107655_config.yaml) and sample_annotation csv (GSE107655_annotation) in the $SRAMETA/GSE107655 folder.

The first step involving looper will be to convert the SRA files into BAM files. geofetch has already set up the yaml file properly.

cd $SRAMETA/GSE107655
looper run GSE107655_config.yaml --sp sra_convert --lump 10

2. Running the RNA-Seq pipeline using looper

Pipeline source code

If you already have all of the tools required by rnaKallisto, then only one command is needed!

cd $SRAMETA/GSE107655
looper run GSE107655_config.yaml --lump 5 --single-end-defaults

After it is finished running, the RNA-Seq results will be in $PROCESSED/GSE107655.

3. Output a countTable for differential expression using DESeqPackager

DESeqPackager uses the PEP project format in R to produce a countDataSet needed for DESeq analysis. More info can be found in the DESeqPackager repository.

After running rnaKallisto, the PEP project does not know the paths to the $PROCESSED files. The solution to this is to add another derived attribute, here called result_source, along with another specification in data_sources, here called src. In addition, in the sample_annotation csv, a column with header result_source and values of src will need to be added. This combination of a derived attribute and data source will essentially fill in the src values in the sample_annotation csv with the correct path to the $PROCESSED files.

derived_attributes: [data_source, result_source]

data_sources:
  SRA: "${SRABAM}{SRR}.bam"
  src: "${PROCESSED}/GSE107655/results_pipeline/{sample_name}/kallisto/abundance.tsv"

Now DESeqPackager will know where to find the abundance.tsv files needed for DESeq. In the future, a functionality may be added to looper to automatically output the location of the processed files.

We can use BiocProject to interface with DESeqPackager by adding a section called bioconductor in the yaml. This will tell BiocProject what function we are using, and also what arguments that function takes.

bioconductor:
  readFunName: DESeqPackager
  readFunPath: DESeqPackager.R
  funcArgs:
    data_source: result_source
    gene_names: target_id
    gene_counts: est_counts

The final GSE107655_config.yaml file should look like the one in this repository.

Now, in R, BiocProject uses DESeqPackager and pepr behind the scenes to output a countDataSet!

source("/path/to/DESeqPackager.R")
bpArgs = BiocProject(file="GSE107655_config.yaml")
getData(bpArgs)
# do DESeq analysis with the produced countDataSet
You can’t perform that action at this time.