Skip to content
An Automated RNASeq Analysis Pipeline (Differential expression to gene enrichment)
R
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R fixed global variable notes Sep 14, 2019
data fixed rjava issue Sep 14, 2019
inst
man fixed global variable notes Sep 14, 2019
.DS_Store fixed rjava issue Sep 14, 2019
.Rbuildignore initial commit Aug 31, 2019
.gitignore initial commit Aug 31, 2019
DESCRIPTION fixed global variable notes Sep 14, 2019
NAMESPACE fixed rjava issue Sep 14, 2019
README.md fixed rjava issue Sep 14, 2019
arseq.Rproj

README.md

arseq: An Automated RNASeq Analysis Pipeline

This is an easy to use R package for automated basic RNASeq analysis with minimal coding requirement. This package is designed to be used by biologists with little to no coding experience.

For an in depth tutorial checkout the following blog post.

The package currently supports the following analysis

Over all data structure analysis:

  • Euclidean distance between samples
  • Poisson distance between samples
  • PCA analysis
  • PCA Eigenvectors
  • Multidimensional scaling (MDS) analysis
  • Most variable genes

Analysis between groups of interest:

  • Differential gene expression analysis using DESeq2
  • Volcano Plot of differentially expressed genes
  • Euclidean distance between samples
  • Poisson distance between samples
  • PCA analysis
  • PCA Eigenvectors
  • Multidimensional scaling (MDS) analysis
  • GO enrichment of the differentially expressed genes
  • KEGG pathway enrichment of the differentially expressed genes
  • KEGG pathway diagrams of the top 5 enriched pathways
  • GSEA analysis (H, C1, C2, C3, C4, C5, C6, C7 genesets)

Requirements

Counts Table

A CSV file with un-normalized unique genes as rows and samples as columns. Counts table is generally generated after your FASTQ files have been aligned against the reference genome and quantified (not included in this pipeline). Please note that you will have to provide the un-normalized data as input. Using normalized data, will not work with this package. Instead of gene names, you could also feed in the data with ENSEMBL ID's. No other form of ID's is supported at the moment.

Example counts table:
Example counts table

Meta data

A CSV file with information regarding the samples. The columns of the count matrix and the rows of the meta data (information about samples) must be in the same order. arseq will not make guesses as to which column of the count matrix belongs to which row of the metadata, these must be provided to arseq already in a consistent order.

Example meta data file:
Example counts table

How to use

Install and load the package.

# For developmental version
if( !require(devtools) ) install.packages("devtools")
devtools::install_github( "ajitjohnson/arseq", INSTALL_opts = "--no-multiarch")

# For stable version
install.packages("arseq")

# Load the package
library("arseq")

Import your counts matrix and meta data file into R environment.

# Set the working directory (path to the folder of where your data is located)
setwd("\path to the folder \of where your data is located\")

# Load your counts table into R
my_data <- read.csv("counts_table.csv", row.names = 1, header = T) # replace counts_table.csv with your file name

# Load your meta data into R
my_meta <- read.csv("meta_data.csv", row.names = 1, header = T) # replace meta_data.csv with your file name

Run the analysis

# Run the analysis. The results will be saved in the same folder as your input data.
arseq (data = my_data, meta = my_meta, design = "treatment", contrast = list(A = c("control"), B= c("drug_A")))

In the above command,

design takes in the column name of the metadata file that contains information regarding the groups you would like to perform differential expression on. You could pass more complex designs- Read the documentation of DESEq2. As an example, in the above image (metadata file), there a column named treatment that contains information regarding which samples are control samples and which samples were treated with different drugs. So if I want to identify the differentially expressed genes between the control samples and treated samples, I would pass design = "treatment".

contrast is another argument that you will need to specify. This is simply the groups of samples between which you would like to perform differential expression analysis. It follows the following format contrast = list(A = c(" "), B= c(" ")).

If you have three groups in your dataset- Control, drug_A and drug_B

Comparison- 1: To identify the differentially expressed genes between Control vs drug_A, you would pass the contrast in the following manner contrast = list(A = c("Control"), B= c("drug_A"))

Comparison- 2: To identify the differentially expressed genes between Control vs drug_A + drug_B, you would pass the contrast in the following manner contrast = list(A = c("Control"), B= c("drug_A", "drug_B"))

Example dataset

The package comes with an example dataset. In order to familiarise yourself with the package and its requirements you could play around with the example dataset.

# view the example counts table
head(example_data)

# view the example meta data
head(example_meta)

# Set the working directory. Folder to which you would like to save your results.
setwd("\path to the folder \that you would like to save the results\")

# Run the analysis. Here we are identifying the differences between control samples and treatment1 samples.
arseq (data = example_data, meta = example_meta, design = "treatment", contrast = list(A = c("drug_A"), B= c("drug_B")))

Additional parameters

The arseq function can take in a few additional arguments.

general.stats- Default is TRUE. This will run the general stat module (e.g. PCA, MDS, etc.. for your entire dataset). If you are making multiple comparisons using the contrast argument, run general.stats = TRUE for the first time and change it to general.stats = FALSE for the subsequent comparisons to speed up the analysis.

variable.genes- Number of variable genes to be identified. By default the program identifies the top 1000 most variable genes. you could set it to variable.genes=3000 to calculate the top 3000 most variable genes.

Cite

If you found this package useful, please do cite this page in your publication. Thank you.

Issues and Features

If there are any issues please report it at https://github.com/ajitjohnson/arseq/issues

Additional information

For an in depth tutorial checkout the following blog post.
You can also tweet me directly for inclusion of new methods into this package.

You can’t perform that action at this time.