Skip to content
ewyang089 edited this page Mar 15, 2017 · 26 revisions

Welcome to the SDEAP wiki!

Differential expression analysis without predefined biological conditions is critical to biological and clinical studies on population data. For example, it can be used to discover biomarkers to classify cancer samples into subtypes such that better diagnosis and therapy methods can be developed for each subtype, or individual cells into cell types.

Although several methods have been introduced recently for performing such differential expression analysis at the gene level, there is no method in the current literature for performing the analysis at the transcript level (i.e., differential transcript expression or DTE analysis). This document describes the installation and usage of the first DTE analysis method for population RNA-Seq data without predefined biological conditions, called SDEAP.

Table of contents

Installation

This software was developed on Linux and it has been tested with g++ 4.4.5. To install it, perform the following steps:

  1. Install scikit-learn (http://scikit-learn.org/stable/). SDEAP has been tested with scikit-learn-0.16.1.
  2. Install R.
  3. Download SDEAP 0.1
  4. Extract the tar.gz archive: $ tar zfxv sdeap_0.*.tar.gz
  5. After extracting the tar.gz file, users will find the two subdirectories EXPR and MERGE. Please, step into each directory and execute:
  $ make clean
  $ make

Afterwards, the EXPR and MERGE binary executable files will be created in the EXPR and MERGE directories, respectively.

Usage

SDEAP is a differential transcript/gene expression analysis tools which identify informative expression features across the given samples. The expression feature can be read counts (the numbers of reads mapped to genes/exons), or the FPKMs of full-length/partial transcripts as described in the paper. To use SDEAP to conduct differential transcript expression (DTE) analysis on population RNA-Seq data, the user needs to (1) convert the input RNA-Seq samples to a table of normalized expression features, where every row represents an expression feature and every column is the expression profile of a sample, and then (2) run the pipeline of SDEAP on the input table for differential expressed features. In the input table, the ids of samples are given in the first row and the name of the expression features are listed in the first column. Every column (other than the first one) is the expression profile of samples and every row (other than the first one) represents the normalized numbers of reads mapped to a gene among all the samples. An example of the input table “normalized_mtx.txt” from single-cell Drop-Seq experiments is provided with the source codes. In the following tutorial, we will go through each step using this example file and demonstrate how we identify cell types using the predicted results by SDEAP.

  1. Selecting informative expression features:

In this step, informative features (rows) in the input table are selected by the regression model as explained in our paper. There are two user-defined parameters. The first is the minimal fold-change rate of the observed CV2 value over the expected CV2 and the second is the minimal mean value of an informative feature. The input file is the given table of expression features and users must specify the name of the output file which contains the informative features. At the same time, the result of the regression based on the input instances of expression features are illustrated in an output PDF file, Rplot.pdf, as in Fig.1, where every data point is an expression feature, the X-axis is for the log2(mean), the y-axis for the CV2 values, and the red regression line shows the expected CV2 values.

$Rscript IF.r <input_file> <output_file> <Fold_chage_CV2>

<input_file> = the name of the input file <output_file> = the name of the output file <Fold_chage_CV2> = the cut-off value of the CV2 fold-change ratio = the minimum of the mean values to be considered as an informative feature

Ex: $Rscript IF.r normalized_mtx.txt if.txt 2 0.5

Fig.1 The observed and expected CV2 values over mean values

Convert RNA-Seq samples to expression features for individual genes (EXPR)

In the SDEAP software package, a utility called EXPR takes the (known or predicted) gene models of the reference genome and the mapped RNA-Seq reads as the input and generates the splice graphs for all genes as the output. It accepts annotated gene models in GTF format and read alignments in the popular BED format. Note that the input data of SDEAP can be splice graphs provided by users directly as long as they are represented in the SG file format. Please, use EXPR_SG to extract expression features from the given splice graphs. The reads mapped to every gene should be save as an individual BED file and the BED files of each RNA-Seq sample should be kept in a separate directory. The name of the BED files for every gene should be the same in the directories. For example, the BED file of the gene ABCD3 should be ABCD3 as well. The annotation of every gene should also be saved as an individual GTF file whose name is the name of the gene. The users of EXPR also need to provide a list of the total numbers of reads in every RNA-Seq samples in millions (saved in a txt file, one line for one library as given in ms_sdeap_lib.txt for an example), a list of the read length in the input RNA-Seq samples (as in MS_len.txt), and a list of genes to be predicted (also saved in a txt file, one line for one gene as in ms_glist.txt) to EXPR. To run EXPR, a command file (e.g., CMD.txt) which specifies the paths to the input, output, and the required lists should be given. Because the extraction of expression features is time-consuming, we highly recommend users to run EXPR in parallel. The total number of computational threads is specified by proc_num in the command file. Then, SDEAP can be invoked by the following command:

$ ./EXPR <the index of threads> <a command file>

An example of this command running in 4 threads:

$ ./ EXPR  0  CMD.txt
$ ./ EXPR  1  CMD.txt
$ ./ EXPR  2  CMD.txt
$ ./ EXPR  3  CMD.txt

After running EXPR (or EXPR_SG), for every gene, there is a profile of the expression features created to summarize the expression features of the gene among the given RNA-Seq samples in the output directory which is specified in the input command file.

Predict differential expressed genes using the profiles of expression features (MERGE)

The MERGE command requires the path to the file containing the profiles of expression features (e.g., CMD_MERGE.txt). In the command, the users of SDEAP should also provide two parameters for filtering. The first parameter is the minimum of fold change rate to the background noise and the second is the maximum of background noise as described in the manuscript. MERGE can be invoked with the following command:

$ ./MERGE < a command file>

For example, $ ./ MERGE CMD_MERGE.txt

#Output

SDEAP will create a filtered_pred.txt file to summarize all differentially expressed (DE) features of the predicted DTE genes. Suppose that there are n RNA-Seq samples in the input data. The filtered_pred.txt file contains 2n + 4 columns.

The first column shows the predicted alternative splicing events. The second column shows the differential expressed features (in SG format) in the alternative splicing events. The following 2n columns represent the n estimated expression levels of the DE features in the samples and the n indices of clusters from the Dirichlet mixture model, respectively. The p-values are in the second last column and FDR values in the last column.

Example:

Alternative splicing events DE features FPKM in samples Cluster indices p-values FDR
RGS6.8 N:S:0,J:S_72$74:0,N:72$74:0 1.3 1.5 0.9 … 0 1 1 1 1 … 0 0 0 0.00012 0.045
RGS6.9 N:72$74:0,J:72$74_78$79:0,N:78$79:0 1.3 1.5 0.9…1.5 1 1 1 1…1 1 1 0.9999 0.9999
Clone this wiki locally