Skip to content
Nuno Agostinho edited this page Aug 17, 2016 · 26 revisions

Welcome to the SDEAP wiki!

Differential expression analysis without predefined biological conditions is critical to biological and clinical studies on population data. For example, it can be used to discover biomarkers to classify cancer samples into subtypes such that better diagnosis and therapy methods can be developed for each subtype, or individual cells into cell types.

Although several methods have been introduced recently for performing such differential expression analysis at the gene level, there is no method in the current literature for performing the analysis at the transcript level (i.e., differential transcript expression or DTE analysis). This document describes the installation and usage of the first DTE analysis method for population RNA-Seq data without predefined biological conditions, called SDEAP.

Table of contents

Installation

This software was developed on Linux and it has been tested with g++ 4.4.5. To install it, perform the following steps:

  1. Install scikit-learn (http://scikit-learn.org/stable/). SDEAP has been tested with scikit-learn-0.16.1.
  2. Install R.
  3. Download SDEAP 0.1
  4. Extract the tar.gz archive: $ tar zfxv sdeap_0.*.tar.gz
  5. After extracting the tar.gz file, users will find the two subdirectories EXPR and MERGE. Please, step into each directory and execute:
  $ make clean
  $ make

Afterwards, the EXPR and MERGE binary executable files will be created in the EXPR and MERGE directories, respectively.

#Usage

To use SDEAP to conduct differential transcript expression (DTE) analysis on population RNA-Seq data, the user needs to (1) convert the input RNA-Seq samples to expression features and then (2) run predict differential expressed genes using the expression features of each individual gene.

Convert RNA-Seq samples to expression features for individual genes (EXPR)

In the SDEAP software package, a utility called EXPR takes the (known or predicted) gene models of the reference genome and the mapped RNA-Seq reads as the input and generates the splice graphs for all genes as the output. It accepts annotated gene models in GTF format and read alignments in the popular BED format. Note that the input data of SDEAP can be splice graphs provided by users directly as long as they are represented in the SG file format. Please, use EXPR_SG to extract expression features from the given splice graphs. The reads mapped to every gene should be save as an individual BED file and the BED files of each RNA-Seq sample should be kept in a separate directory. The name of the BED files for every gene should be the same in the directories. For example, the BED file of the gene ABCD3 should be ABCD3 as well. The annotation of every gene should also be saved as an individual GTF file whose name is the name of the gene. The users of EXPR also need to provide a list of the total numbers of reads in every RNA-Seq samples in millions (saved in a txt file, one line for one library as given in ms_sdeap_lib.txt for an example), a list of the read length in the input RNA-Seq samples (as in MS_len.txt), and a list of genes to be predicted (also saved in a txt file, one line for one gene as in ms_glist.txt) to EXPR. To run EXPR, a command file (e.g., CMD.txt) which specifies the paths to the input, output, and the required lists should be given. Because the extraction of expression features is time-consuming, we highly recommend users to run EXPR in parallel. The total number of computational threads is specified by proc_num in the command file. Then, SDEAP can be invoked by the following command:

$ ./EXPR <the index of threads> <a command file>

An example of this command running in 4 threads:

$ ./ EXPR  0  CMD.txt
$ ./ EXPR  1  CMD.txt
$ ./ EXPR  2  CMD.txt
$ ./ EXPR  3  CMD.txt

After running EXPR (or EXPR_SG), for every gene, there is a profile of the expression features created to summarize the expression features of the gene among the given RNA-Seq samples in the output directory which is specified in the input command file.

Predict differential expressed genes using the profiles of expression features (MERGE)

The MERGE command requires the path to the file containing the profiles of expression features (e.g., CMD_MERGE.txt). In the command, the users of SDEAP should also provide two parameters for filtering. The first parameter is the minimum of fold change rate to the background noise and the second is the maximum of background noise as described in the manuscript. MERGE can be invoked with the following command:

$ ./MERGE < a command file>

For example, $ ./ MERGE CMD_MERGE.txt

#Output

SDEAP will create a filtered_pred.txt file to summarize all differentially expressed (DE) features of the predicted DTE genes. Suppose that there are n RNA-Seq samples in the input data. The filtered_pred.txt file contains 2n + 4 columns.

The first column shows the predicted alternative splicing events. The second column shows the differential expressed features (in SG format) in the alternative splicing events. The following 2n columns represent the n estimated expression levels of the DE features in the samples and the n indices of clusters from the Dirichlet mixture model, respectively. The p-values are in the second last column and FDR values in the last column.

Example:

Alternative splicing events DE features FPKM in samples Cluster indices p-values FDR
RGS6.8 N:S:0,J:S_72$74:0,N:72$74:0 1.3 1.5 0.9 … 0 1 1 1 1 … 0 0 0 0.00012 0.045
RGS6.9 N:72$74:0,J:72$74_78$79:0,N:78$79:0 1.3 1.5 0.9…1.5 1 1 1 1…1 1 1 0.9999 0.9999
Clone this wiki locally