RNA-VC is a pipeline that can analyse any RNA sequencing (RNA-seq) raw data
available in the Gene Expression Omnibus (GEO) and the Sequence Read
Archive (SRA), yielding both variant calling data and gene/transcript
expression data. It can also be run without the variant calling, if you are
only interested in expression data. It is written with a combination of bash
and python
, all wrapped up in the snakemake
workflow manager system.
RNA-VC differentiates itself from the other available RNA-seq variant calling pipelines in that you do not need to have the raw data downloaded before you start; RNA-VC takes care of that for you. All you need is a metadata file with the specified samples you want to analyse.
RNA-VC is something I created for my own use in order to automate analyses of publicly available RNA-seq. I share it here on GitHub on the off-chance that some other researcher would have a use for it, but I don't guarantee that it'll work for you. That being said, if you want to use RNA-VC and are having trouble getting to it run properly I'd be happy to help!
There are a number of bioinformatic software packages that need to be installed in order to run the pipeline:
- Python 3 and Snakemake for running the pipeline
- SRAtools for downloading raw FASTQ files
- Salmon for estimating expression levels
- STAR for aligning the reads
- SAMtools for working with the resulting alignment files
- PICARD for marking duplicate reads
- GATK for performing variant calling
- snpEFF for annotating the resulting variants
You must have all of these installed if you are to run the pipeline (if you are
using a computer cluster they might already be installed, if you're lucky).
After you have made sure they are all installed, you can clone
this
repository to get all the relevant files:
git clone https://github.com/fasterius/RNA-VC
You also need to provide RNA-VC with the metadata describing the GEO samples you want to analyse. This means that you need to provide, at the very least, SRR IDs, their corresponding GSE IDs, and their read layouts (listing them either as "SINGLE" or "PAIRED", respectively). You may optionally add a column for sample groups, which will be merged at the variant calling stage. This is useful for when you have sample replicates that you want to treat as a group, which is common for variant calling procedures. Such a metadata file might look something like this:
GSE | SRR | Layout | Group |
---|---|---|---|
GSE81194 | SRR3479755 | PAIRED | Sample_1 |
GSE81194 | SRR3479758 | PAIRED | Sample_2 |
GSE81194 | SRR3479759 | PAIRED | Sample_2 |
The config.yaml
file provided can then be edited according to the structure
of your metadata file, in addition to the locations of the references, indexes
software paths required. I have provided an example of what this config file
may look like, but you need to edit the paths to correspond to your directory
structure before you can use it. You finally run the pipeline using
snakemake
:
snakemake --config LAYOUT=PAIRED
If you have mixed read layouts you have to run the pipeline twice: once for
each read layout. Alternatively, if you are running the pipeline on a cluster
(RNA-VC currently only supports SLURM
) you can use the submit_snakemake.sh
wrapper, which will automatically loop through both read layouts:
bash submit_snakemake.sh
If you do run it on a cluster, I have also provided an example of a
cluster.yaml
configuration file, which you will also need to edit. I have
provided the example for the SLURM system, as that is what I am using; if you
want to use some other cluster configuration, please see the
Snakemake website for more information on how to do it.
The pipeline is available with a MIT licence. It is free software: you may
redistribute it and/or modify it under the terms of the MIT license. For more
information, please see the LICENCE
file.