GitHub - bdgp/grit

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 1,379 Commits
analyze		analyze
bin		bin
docs		docs
grit		grit
simulator		simulator
tests		tests
.gitignore		.gitignore
COPYING		COPYING
MANIFEST.in		MANIFEST.in
README		README
TODO		TODO
setup.py		setup.py
Repository files navigation

GRIT - A tool for the integrative analysis of RNA-seq type assays. 

################################################################################
Install: 
################################################################################

Build Dependencies:
  c compiler
  python headers
  cython

Runtime Dependencies:
  scipy 
  networkx 
  pysam

In a debian based system, running 

(sudo) apt-get install gcc python-dev cython python-scipy python-networkx
and then 
(sudo) easy_install pysam

should install all of the dependencies. 

INSTALLATION:

-Method 1: easy_install

Run: (sudo) easy_install GRIT-2.0.1.tar.gz

This should install the dependencies, the grit python module, and the grit 
script (run_grit.py). However, installing them through your distributions 
package manager is preferred whenever possible.

-Method 2: setup.py

1) unzip the package (tar -zxvf GRIT-VERSION.tar.gz)
2) move into the unzipped directory (cd GRIT-VERSION/)
3) run the setup script ((sudo) python setup.py install)

-Method 3: packages

Making debian, ubuntu, and redhat packages is on my TODO, but I haven't. 
If these would be useful, please feel free to send me an email and I'll
move it up the priority list.



################################################################################
Tutorial: 
################################################################################

There is a tutorial, with data, available on grit-bio.org.

################################################################################
Introduction:
################################################################################

GRIT is designed to use RNA-seq, TES (e.g. poly(A) seq), and TSS (
e.g. CAGE, RAMPAGE) data to build and quantify full length transcript
models. When all of these data sources are not available, GRIT can be 
run by providing a candidate set of TES or TSS sites. In addition, GRIT
can merge in reference junctions, and gene boundaries. GRIT can also
be run in quantification mode, where it uses a provided GTF file and 
just estimates transcript expression.

In addition, GRIT only works with stranded and paired RNA-seq data. 
Although it could be modified to work with either, we feel that these
data sources are becoming obsolute.


################################################################################
Simple Example:
################################################################################

(to try this, download and unzip 
    http://grit-bio.org/GRIT_example.tar.gz ) 

The simplest possible GRIT run is:

run_grit --rnaseq-reads AdMatedF_Ecl_20days_Heads.biorep1.rnaseq.chr4.bam  \
            --cage-reads AdMatedF_Ecl_20days_Heads.biorep1.cage.chr4.bam \
            --polya-reads AdMatedF_Ecl_20days_Heads.biorep2.passeq.chr4.bam \
            --reference flybase-r5.45.chr4.gtf

Note that the reference is required to determine the read strands. The following
would produce identical results:

run_grit --rnaseq-reads AdMatedF_Ecl_20days_Heads.biorep1.rnaseq.chr4.bam  \
            --rnaseq-read-type backward \
            --cage-reads AdMatedF_Ecl_20days_Heads.biorep1.cage.chr4.bam \
            --cage-read-type backward \
            --polya-reads AdMatedF_Ecl_20days_Heads.biorep2.passeq.chr4.bam \
            --polya-read-type forward

It is only possible to use a single bam for each data type when using command 
line options, and there is no notion of a replicate. However, GRIT also accepts
a control file, which allows for substantial flexibility.

Output:

This will output 3 data files:

discovered.elements.bed
discovered.transcripts.gtf
discovered.expression.csv

which contain the transcript elements (e..g exons and promoters), the 
transcripts, and transcript level expression estimates. 

Additional options:

Typically, one would run the above with:
--ucsc    : formats the .bed and .gtf files so that they can be loaded into the
            ucsc genome browser
--threads : specifies to use this many concurrent processes
--fasta   : a fasta file, which allows GRIT to run an ORF finder on the 
            discovered transcripts

In addition, for samples with relatively low read coverage, it can be useful to 
include reference elements. Our experience is that junctions are relatively well
annotated, so using --reference with --use-reference-junctions can help to 
improve gene connectivity, and because the reference junctions are assigned 
a count of zero during quantification, won't bias the expression estimates.

We would not recommend using reference TSS and TES's unless there are no other 
options. We have found them to be of much lower quality than other transcript
elements. Instead, looking for publically avbailable CAGE and poly(A)-site-seq
data in a matching tissue or cell line seems to perform better (in human, FANTOM
and Merck poly(A) data are good resources. In fly, modENCODE has all the data 
types).

################################################################################
Control File Example:
################################################################################

Instead of the above commands, we could also run:

run_grit.py --control AdMatedF_Ecl_20days_Heads.control.txt

Here's an example control file.

# comment line
# *'s indicate merged
#sample_type            rep_id  assay   paired  stranded  read_type  filename

AdMatedF_20days_Heads   rep1    rnaseq  true    true      auto       AdMatedF_Ecl_20days_Heads.biorep1.rnaseq.chr4.bam 
AdMatedF_20days_Heads   rep2    rnaseq  true    true      auto       AdMatedF_Ecl_20days_Heads.biorep2.rnaseq.chr4.bam
AdMatedF_20days_Heads   *       cage    false   true      auto       AdMatedF_Ecl_20days_Heads.biorep1.cage.chr4.bam
AdMatedF_20days_Heads   *       polya   false   true      auto       AdMatedF_Ecl_20days_Heads.biorep2.passeq.chr4.bam

sample_type - a unique identifier for the biological sample type. 
rep_id      - a unique identifier for the replicate

Data from the same sample type is merged for building elements (e.g. junction
discovery, exon building, etc.) but transcripts are quantified on individual 
replicates. A '*' as a rep_id indicates that the data should be used for all
replicates for that sample type. 

If multiple bam files with the same sample_type and rep_id are provided, they 
will be merged. So, for instance, if you have an RNAseq experiment where one 
has a read_type of forward, and one has a read_type of backward, then they 
should each be provided a line with different read types.

Again, in practice we would probably run:

run_grit --control AdMatedF_Ecl_20days_Heads.control.txt --verbose --ucsc \
            -t 16 --reference flybase-r5.45.chr4.gtf --use-refernce-junctions \
            --fasta dm_rel_5.fa

################################################################################
Peak Calling Example:
################################################################################
In addition to discovering and building transcript models, GRIT can be used to 
identify TSS/TES from RAMPAGE/CAGE/PASseq data and an RNA-seq control. The 
ENCODE consortium uses the call_peaks script distributed with GRIT v2.0.4 
to identify RAMPAGE peaks. Installation is identical to isntalling the full GRIT
software pacakge, as described above.

To try calling peaks using the ENCODE RNA evaluation data (XXX) 
for download a position sorted and indexed RAMPAGE file (eg XXX), a matching 
RNAseq sorted and indexed bam file (eg XXX), and a reference annotation in GTF
format (eg XXX). To download test data, in a suitable working directory, run:
wget -r -nd --no-parent -A '*.bam*,*.gtf' \
     http://mitra.stanford.edu/kundaje/nboley/grit-bio.org/ENCODE_RAMPAGE_TEST/;
The exact commands for a linux machine with GRIT installed can be found below.

Then run:
call_peaks --rampage-reads $RAMPAGEReads.sorted.bam \
           --rnaseq-reads $RNASeqReads.sorted.bam \
           --reference $referenceAnnotation.gtf \
           --exp-filter-fraction 0.05 --trim-fraction 0.01 \
           --ucsc \
           --outfname peaks.gff --outfname-type gff \
           --bed-peaks-ofname peaks.bed \
           --threads $nThreads 

Here we describe the individual options used above - the full set of options and
descriptions can be found by running call_peaks --help. 

--rampage-reads: a position sorted, indexed bam file containing rampage reads to
                 identify peaks from. 
                 
                 The ENCODE consoritum uses STAR to map the RAMPAGE reads, 
                 PCR remove duplicates, identify junctions, etc. Details of the
                 mapping procedure can be found here: 
                     https://github.com/ENCODE-DCC/long-rna-seq-pipeline
                 
                 If a reference annotation is not provided, then the user must 
                 set the --rampage-read-type option to forward or backward. If
                 this option is set to forward, then if the first read pair 
                 maps to the reference genome without being reverse complemented
                 then the read is assumed to have come from a + stranded gene. 
                 
--rnaseq-reads:  a position sorted, indexed bam file containing rnaseq reads 
                 from the same sample as the --rampage-reads. The RNA-seq reads
                 help GRIT distinguish noise in a similar way to how a DNA seq
                 experiment is used to control for a ChIP-seq experiment. GRIT
                 currently requires RNAseq reads generated using a stranded
                 protocol. 
 
                 The ENCODE consoritum uses STAR to map the RNAseq reads; 
                 details of the ENCODE mapping pipeline can be found here: 
                     https://github.com/ENCODE-DCC/long-rna-seq-pipeline
                 
                 If a reference annotation is not provided, then the user must 
                 set the --rnaseq-read-type option to forward or backward. If
                 this option is set to forward, then if the first read pair 
                 maps to the reference genome without being reverse complemented
                 then the read is assumed to have come from a + stranded gene. 

--reference:     a GTF file containing reference transcripts. GRIT does not 
                 require a reference annotation, but providing one helps 
                 improve the quality of the result by providing connectivity 
                 between transcribed regions. A reference annotation is also 
                 required to automatically infer read strand.  

--exp-filter-fraction: peaks with low expression levels *relative to the gene
                 region in which they lie* will be filtered regardless of their
                 estimated statistical enrichment. This helps the results to be
                 robust to differences between the RAMPAGE and RNAseq 
                 experiments. The ENCODE consortium sets this value to 0.05 
                 which means that a peak will be removed if there is another 
                 peak in the same gene with an expression value >= 20x higher. 
                 5% is relatively conservative - those primarily focused on new
                 discovery that have properly matched RAMPAGE/RNAseq experiments
                 should feel comfortable lowering this threshold to 1%. If there
                 is not a control RNASeq experiments from a matching biological 
                 sample then values as high as 25% may be appropriate. 

--trim-fraction: Specify how much to trim the edge of peaks. ENCODE uses a value
                 of 0.01, which specifies that each boundary will be trimmed 
                 while the trimming has removed less than 1% of the total reads
                 that fall into the peak. 

--ucsc:          Use contig names compatible with the ucsc browser (this 
                 typically just involves prepending a chr to the contig names)
               
--outfname:      The name of the output file.
--outfname-type: The type of the output file. We recommend using the gff output
                 whenever possible as future updates will include additional 
                 meta data which will likely break bed backwards compatibility. 

--bed-peaks-ofname: Write an additional output file in bed format. This is 
                 primarily used for downstream analysis tools that require bed
                 input. 
 
--threads:       The number of processes to run concurrently.

--verbose:       Open a ncurses interface which allows each stage to be 
                 monitored.

## Ouput Formats ###############################################################

gff:             
gff output is standard gff output with the following id files:

gene_id:         The  ID of the GRIT determined gene region this peaks falls in.
gene_name:       Currenty identical to gene_id - in the future this will contain
                 the name of the nearest matching reference gene.
tss_id:          A unique identifier for this TSS id. 
peak_cov:        The RAMPAGE read counts falling within this peak. This is 
                 useful for identifying peak summits, identifying peak 
                 enrichment, identifying binding motifs associated with this
                 peak, etc.  

bed:
The bed output is a standard bed6 with the following additional fields:
read countq    : int
gene_id        : string
gene_name      : string
tss_id         : string
peak_coverage  : comma separated floats specifying the observed read coverage 
                 across the peak 

## Running IDR on peak replicates ##############################################

The ENCODE consortium uses the IDR software package (
https://github.com/nboley/idr/) to identify peaks which are reproducible between
replicates. Given the bed output from two peak calling runs (e.g. PEAK1.bed and 
PEAK2.bed) by running:

idr --samples REP1.bed REP2.bed     # run IDR on the REP1.bed and REP2.bed
    --input-file-type bed --rank 7  # specifies that the input is a bed format, 
                                    # and the score track is in column 7
    --output-file IDR.peaks.txt     # output the peak list and reproducibility 
                                    # information to IDR.peaks.txt
    --plot                          # produce a QC plot, IDR.peaks.txt.png

Details about command line options and a description of the output files can be
found at https://github.com/nboley/idr/.

## Example commands for ENCODE test data #######################################

For a linux machine with wget, GRIT and IDR installed, in a suitable working
directory run:

wget -r -nd --no-parent -A '*.bam*,*.gtf' \
     http://mitra.stanford.edu/kundaje/nboley/grit-bio.org/ENCODE_RAMPAGE_TEST/;

call_peaks --rampage-reads ENCFF001RDX-ENCFF001RDR_rampage_star_marked.chr20.bam \
           --rnaseq-reads ENCFF001RFD-ENCFF001RFC_star_genome.chr20.bam \
           --reference gencode.v19.annotation.chr20.gtf  \
           --exp-filter-fraction 0.05 --trim-fraction 0.01 \
           --ucsc \
           --outfname peaks.rep1.gff --outfname-type gff \
           --bed-peaks-ofname peaks.rep1.bed \
           --threads 24;

call_peaks --rampage-reads ENCFF001RET-ENCFF001RES_rampage_star_marked.chr20.bam \
           --rnaseq-reads ENCFF001RFF-ENCFF001RFE_star_genome.chr20.bam \
           --reference gencode.v19.annotation.chr20.gtf \
           --exp-filter-fraction 0.05 --trim-fraction 0.01 \
           --ucsc \
           --outfname peaks.rep2.gff --outfname-type gff \
           --bed-peaks-ofname peaks.rep2.bed \
           --threads 24; 

idr        --samples peaks.rep1.bed peaks.rep2.bed \
           --input-file-type bed --rank 7 \
           --output-file peaks.IDR.txt --plot;

################################################################################
Command Line Options:
################################################################################

Run 'run_grit --help' for a complete list. They should be well described.

################################################################################
Questions/bug reports:
################################################################################

Please direct questions and bug reports to the mailing list (
http://groups.google.com/group/grit-bio ). I'll try to respond promptly to 
bug reports, so please do not hesitate to contact me with questions.