This repository contains an example for running a differential expression analysis using bcbio to align the fastq files, and the limma package in R to model differential expression between tissues. It also contains code comparing the data used to data from a few other sources.
- comparison_data: TPM matrices for comparing our data to other datasets in comparison.Rmd
- bcbio_run.sh: Script for preparing and launching bcbio
- bcbio_slurm.sh: Script for running bcbio, intended to be submitted to a cluster (bcbio_run.sh does this)
- comparison.Rmd: R Markdown document for comparing our data to other datasets
- download_fastqs.sh: Script for downloading fastq files using the SRA toolkit
- illumina-rnaseq.yaml: Template description of bcbio pipeline. Used by bcbio_run.sh to set up for bcbio
- immune_specific_genes.txt: a list of immune specific genes used in tissue_specificity_analysis.Rmd
- sra_data.csv: a csv files with SRA numbers in the first column and the corresponding sample name in the second column
- tissue_specificity_analysis.Rmd: R Markdown document detailing the tissue specificity analysis
The provided bits of code assume you have the following software installed:
- bcbio is a tool for running various pre-processing pipelines for sequencing data.
- The code was originally run on version 1.1.5
- R and RMarkdown, we recommend using both with Rstudio.
- R is a programming language, and RMarkdown is a package for producing documents with embedded R scripts. Rstudio is an integrated devolpment enviroment (IDE) for R.
- The code was originally run on R version 3.5.2 and RStudio version 1.1.456
- Additional R packages:
- SRA toolkit
- The NCBI SRA toolkit is a suite of programs used for accessing files from the SRA databse
- The code was originally run on version 2.9.3
When running bcbio, the code also assumes that you are working on a linux-based computing cluster, running a scheduler. As written, it is setup for the SLURM scheduler, but it should work with other schedulers with minor modifications to bcbio_run.sh and bcbio_slurm.sh . Additionally, by modifying bcbio_run.sh (and bypassing bcbio_slurm.sh), one could run bcbio locally as an alternative.
We use bcbio to align our fastq files to a reference genome and then produce a table of counts for each gene/sample.
In the main directory, run
./download_fastqs.sh. This runs a few commands to make a fastqs directory and download fastq files into it. This script depends on the SRA toolkit. This took us 2 hours, but is probably highly dependent on your internet connection. The compressed fastq files are about 14GB.
While logged into your computing cluster of choice, in the main directory run
./bcbio_run.sh. This will run a few commands to set up the directory for aligning with bcbio, and then submit the job descirbed in bcbio_slurm.sh to the cluster. It took us about 4 hours, but it'll depend on the parameters of your computing cluster. The files bcbio produces take up about 160GB, but the large majority are temporary and can be removed after the run has completed.
2: Tissue Specificity Analysis
We use R to analyze the counts table produced in (1) to look for genes which are upregulated in particular tissues.
Open tissue_specificity_analysis.Rmd in Rstudio. To run it all at once, knit it (there should be a "knit" button on the upper bar). Alternatively, you can step through the commands one by one.
Will take about a minute.
3: Dataset Comparison
We use R to compare our data to data from a few other sources and asses its validity.
Open comparison.Rmd in Rstudio. To run it all at once, knit it. Alternatively, you can step through the commands one by one.
Will take 20-30 minutes.