This package contains the source code, pre-compiled executables and a few C++ library dependencies for LESSeq - Local Event-based analysis of alternative Splicing using RNA-seq.
R should be installed to run steps involving R scripts (http://cran.us.r-project.org/).
################# ###How to install
Users can download the required boost library from: http://archive.gersteinlab.org/boost/
The "LESSeq/bin" directory contains all executables and R scripts needed to run the pipeline. If user chooses to use the pre-compiled executables, the only action needed is as follows: in .bashrc file, add "export LD_LIBRARY_PATH= PATH_TO_PACKAGE/gsl/lib/:PATH_TO_PACKAGE/cppunit/lib/" and "export PATH=PATH_TO_PACKAGE/bin/:$PATH", where PATH_TO_PACKAGE is the absolute path to "LESSeq".
If user chooses to compile the source code, go to the "solve", "count" and "classify" directories and type "make" from command line. Then move the compiled executables generated in the "solve/bin", "count/bin" and "classify/bin" directories to the "LESSeq/bin/" directory. Finally in .bashrc file, add "export LD_LIBRARY_PATH= PATH_TO_PACKAGE/gsl/lib/:PATH_TO_PACKAGE/cppunit/lib/:PATH_TO_PACKAGE/boost_1_34_1/boost/" and "export PATH=PATH_TO_PACKAGE/bin/:$PATH"
################# ###How to run Below are the instructions on realizing the four major steps described in the LESSeq paper.
Refine gene models using RNA-Seq Type "cufflinks" from command line (cufflinks executable is included in this package for this purpose, but can be replace with other tools that the user prefers). Cufflinks manual can be found at http://cufflinks.cbcb.umd.edu/ .
Identify local events Type "classify" from command line, and the usage will be prompted. This program will generate splicing graphs for each gene. Then use "Events.r" to generate local events from the splicing graphs.
Below are parameters that need to be specified when using "classify", as well as "count" and "solve" in the next step (not all are used in each of these programs): log_level determines how much information to output on the screen during running, and should be an integer value; proj_name is the name of the project given by the user; out_prefix is the directory name for output files, the directories specified in the parameter should already exist, include "/" at the end of the directory name; isoform_format is the annotation file format for each form of the local events (i.e. specify the coordinates of different forms of local events), and the current choice is LH_GENE_TXT (which is the same as "interval" format defined here http://info.gersteinlab.org/RSEQtools#Interval); isoforms_path is the path to the above annotation file; g2i_format is the format of file for grouping local event forms (i.e. indicate which local event forms belong to the same local event), the current choice is UCSC_GENE2ISOFORM (which is the same as the format of "knownIsoforms.txt" files at http://info.gersteinlab.org/RSEQtools#mergeTranscripts); g2i_path is the path to the above grouping file; gene_begin_idx is the index of the first of the local event to be quantified and gene_end_idx is the index of the last local event to be analyzed; read_format is the format of aligned reads, the current choice is MRF_SINGLE (single-end reads in MRF format http://info.gersteinlab.org/RSEQtools#Mapped_Read_Format_.28MRF.29); read_type is the type of reads, the current choice is SHORT_READ; expected_read_length is the average read length; reads_path is the path to the read alignment file; total_read_bases is the total number of bases in the alignment file.
We provide two executables (in "LESSeq/bin" directory) for generating two file formats needed in these programs from GTF/GFF files: "parseGencode" converts a GTF/GFF file generated by cufflinks into the LH_GENE_TXT (or "interval") annotation format specified above, and its usage is "cat GTF/GFF_INPUT_FILE_NAME |parseGencode >OUTPUT_FILE_NAME.interval" "gencodeIsoformMap" generates the UCSC_GENE2ISOFORM grouping format specified above, and its usage is "cut -f1 OUTPUT_FILE_NAME.interval | gencodeIsoformMap > OUTPUT_FILE_NAME.map"
The output from this step are 8 sets of annotation files for the 8 local event types specified in the paper - skipped exon (SE), retained intron (RI), alternative 5’ splice site (A5SS), alternative 3’ splice site (A3SS), mutually exclusive exon (MXE), alternative first exon (AFE), alternative last exon (ALE) and tandem 3’ UTRs (T3). For each event type, two files are generated - one LH_GENE_TXT (or "interval") format file with the annotation information of local event forms, and one UCSC_GENE2ISOFORM format file with the grouping information of local event forms.
- Counts reads compatible with alternative forms of local events, and estimate their relative expression levels Type "count" from command line to get raw read counts of reads compatible with alternative forms of local events; and type "solve" to get estimated relative expression levels. The usage will be prompted for each program. The read counts and relative expression level flat files are printed on standard output, and can be directed to a file with ">".
The output file from "count" program is specified below: column1: grouping ID of local events column2: total number of reads mapped to a local event column3: ID of a specific form of a local event column4: number of reads compatible with the specific form of local event
The output file from "solve" program is specified below: column1: grouping ID of local events column2: total number of reads mapped to a local event column3: ID of a specific form of a local event column4: relative expression level of the specific form of local event column5: RPKM value of the specific form of local event column6: log likelihood statistic of the estimation
- Test differential alternatie splicing Several R packages ("multtest","epicalc","lmtest","xtable","MASS") are required to run the log-linear model method in the Test_AS.r script. Follow instructions in "Test_AS.r" to conduct relevant statistical test.