Skip to content
ewyang089 edited this page Mar 15, 2017 · 26 revisions

Welcome to the SDEAP wiki!

Differential expression analysis without predefined biological conditions is critical to biological and clinical studies on population data. For example, it can be used to discover biomarkers to classify cancer samples into subtypes such that better diagnosis and therapy methods can be developed for each subtype, or individual cells into cell types.

Although several methods have been introduced recently for performing such differential expression analysis at the gene level, there is no method in the current literature for performing the analysis at the transcript level (i.e., differential transcript expression or DTE analysis). This document describes the installation and usage of the first DTE analysis method for population RNA-Seq data without predefined biological conditions, called SDEAP.

Table of contents

Installation

This software was developed on Linux and it has been tested with g++ 4.4.5, python 2.7.10 and R 3.2.3. To install it, perform the following steps:

  1. Install scikit-learn (http://scikit-learn.org/stable/).

    SDEAP has been tested with scikit-learn-0.18.

  2. Install the following R packages.

    ggplot2

  3. Download SDEAP 0.99

    • From Github: git clone https://github.com/ewyang089/SDEAP.git

Usage

SDEAP is a differential transcript/gene expression analysis tools which identify informative expression features across biological samples. Here, the expression features can be read counts (the numbers of reads mapped to genes/exons/expressed segments), or the FPKMs of full-length/partial transcripts as described in the paper. To use SDEAP to conduct differential expression analysis on population RNA-Seq data, the user needs to (1) convert the input RNA-Seq samples to a table of normalized expression features, where every row represents an expression feature and every column is the expression profile of a sample, and then (2) run the pipeline of SDEAP on the input table for differential expressed features. In the input table, the ids of samples are given in the first row and the name of the expression features are listed in the first column. An example of the input table “normalized_mtx.txt” from single-cell Drop-Seq experiments is provided with the source codes. In the following tutorial, we will go through each step using this example file and demonstrate how we identify the cell types of the cells based on the predicted results by SDEAP.

1. Selecting informative expression features:

In this step, informative features (rows) in the input table are selected by the regression model as explained in our paper. There are two user-defined parameters. The first is the minimal fold-change rate of the observed CV2 value over the expected CV2 and the second is the minimal mean value of an informative feature. The input file is the given table of expression features and users must specify the name of the output file which contains the informative features. At the same time, the result of the regression based on the input instances of expression features are illustrated in an output PDF file, Rplot.pdf, as in Fig.1, where every data point is an expression feature, the X-axis is the log2(mean) values, the y-axis is the observed CV2 values, and the red regression line shows the expected CV2 values.

$Rscript IF.r <input_file> <output_file> <fold_chage_CV2> <min_mean>

<input_file> = the name of the input file

<output_file> = the name of the output file

<fold_chage_CV2> = the cut-off value of the CV2 fold-change ratio

<min_mean> = the minimum of the mean values to be considered as an informative feature

Ex:

$Rscript IF.r normalized_mtx.txt if.txt 2 0.5

Fig.1 The observed and expected CV2 values over mean values

Fig.1 The observed and expected CV2 values over the mean values

2. Clustering by the Dirichlet infinite mixture model

The instances of each informative feature are clustered by the Dirichlet infinite mixture model (DPMM_new.py) as illustrated in the paper. The input file is the informative features from the last step and the output file are the cluster labels for each instance. All instances with the same cluster label are in the same cluster. The python script (DPMM_new.py) is run by the following command line.

$ Python DPMM_new.py <informative_features>

<informative_features> = the file name of the informative features

EX:

$Python DPMM_new.py ifs.txt

In this case, the cluster labels will be generated in the output file, “ifs.txt.cluster”, in the same directory of the input file.

3. Testing for the significance of difference between clusters of instance by the ANOVA tests

Based on the clustering results, the significance of difference between the clusters is measured statistically by the ANOVA test (ANOVA.r). In this step, there are two input files, <informative_features> and <cluster_labels>. At the same time, the minimum value of false discovery rates (FDRs) is required as a user-defined parameter to filter out every informative feature with no significant difference between its clusters of instances. There are two output files by the R script. The first output file, <informative_features>+”.heatmap”, contains the filtered informative features by the user-defined cut-off value of FDRs. This output file can be further used as the input file of the following hierarchical clustering to visualize the biological conditions of samples (please see the Step 4). The other output file is <informative_features>+”.pvalues” which shows the p-values and the FDR values (in the last two columns) with all the instances of all the informative features.

Rscript ANOVA.r <informative_features> <cluster_labels> <min_FDR>

<informative_features> = the selected expression features from the Step 1.

<cluster_labels> = the cluster indices of the instances generated from the Step 2.

<min_FDR> = the minimum of the FDR values

Ex:

$Rscript ANOVA.r if.txt if.txt.cluster 0.1

4. (optional ) Hierarchical clustering of samples and informative features:

Here, we provide the script to bi-cluster and visualize the expression profiles of samples. To use the script, the R library Made4 is required to be installed. (http://www.bioconductor.org/packages//2.7/bioc/html/made4.html). The usage of the script is as follows.

Rscript heatmap.r <dge_heatmap> <dge_heatmap> = the filtered informative feature from the Step 3.

Ex:

Rscript heatmap.r if.txt.heatmap

A heatmap and the dendrograms of hierarchical clustering on samples and expression features are presented in an output PDF file (Rplot.pdf). In our example, the cell types of samples are clearly identified as show in the Fig. 2.

Fig. 2 The heatmap and dendrograms of the input samples