forked from nboley/grit
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
355 lines (271 loc) · 16.1 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
GRIT - A tool for the integrative analysis of RNA-seq type assays.
################################################################################
Install:
################################################################################
Build Dependencies:
c compiler
python headers
cython
Runtime Dependencies:
scipy
networkx
pysam
In a debian based system, running
(sudo) apt-get install gcc python-dev cython python-scipy python-networkx
and then
(sudo) easy_install pysam
should install all of the dependencies.
INSTALLATION:
-Method 1: easy_install
Run: (sudo) easy_install GRIT-2.0.1.tar.gz
This should install the dependencies, the grit python module, and the grit
script (run_grit.py). However, installing them through your distributions
package manager is preferred whenever possible.
-Method 2: setup.py
1) unzip the package (tar -zxvf GRIT-VERSION.tar.gz)
2) move into the unzipped directory (cd GRIT-VERSION/)
3) run the setup script ((sudo) python setup.py install)
-Method 3: packages
Making debian, ubuntu, and redhat packages is on my TODO, but I haven't.
If these would be useful, please feel free to send me an email and I'll
move it up the priority list.
################################################################################
Tutorial:
################################################################################
There is a tutorial, with data, available on grit-bio.org.
################################################################################
Introduction:
################################################################################
GRIT is designed to use RNA-seq, TES (e.g. poly(A) seq), and TSS (
e.g. CAGE, RAMPAGE) data to build and quantify full length transcript
models. When all of these data sources are not available, GRIT can be
run by providing a candidate set of TES or TSS sites. In addition, GRIT
can merge in reference junctions, and gene boundaries. GRIT can also
be run in quantification mode, where it uses a provided GTF file and
just estimates transcript expression.
In addition, GRIT only works with stranded and paired RNA-seq data.
Although it could be modified to work with either, we feel that these
data sources are becoming obsolute.
################################################################################
Simple Example:
################################################################################
(to try this, download and unzip
http://grit-bio.org/GRIT_example.tar.gz )
The simplest possible GRIT run is:
run_grit --rnaseq-reads AdMatedF_Ecl_20days_Heads.biorep1.rnaseq.chr4.bam \
--cage-reads AdMatedF_Ecl_20days_Heads.biorep1.cage.chr4.bam \
--polya-reads AdMatedF_Ecl_20days_Heads.biorep2.passeq.chr4.bam \
--reference flybase-r5.45.chr4.gtf
Note that the reference is required to determine the read strands. The following
would produce identical results:
run_grit --rnaseq-reads AdMatedF_Ecl_20days_Heads.biorep1.rnaseq.chr4.bam \
--rnaseq-read-type backward \
--cage-reads AdMatedF_Ecl_20days_Heads.biorep1.cage.chr4.bam \
--cage-read-type backward \
--polya-reads AdMatedF_Ecl_20days_Heads.biorep2.passeq.chr4.bam \
--polya-read-type forward
It is only possible to use a single bam for each data type when using command
line options, and there is no notion of a replicate. However, GRIT also accepts
a control file, which allows for substantial flexibility.
Output:
This will output 3 data files:
discovered.elements.bed
discovered.transcripts.gtf
discovered.expression.csv
which contain the transcript elements (e..g exons and promoters), the
transcripts, and transcript level expression estimates.
Additional options:
Typically, one would run the above with:
--ucsc : formats the .bed and .gtf files so that they can be loaded into the
ucsc genome browser
--threads : specifies to use this many concurrent processes
--fasta : a fasta file, which allows GRIT to run an ORF finder on the
discovered transcripts
In addition, for samples with relatively low read coverage, it can be useful to
include reference elements. Our experience is that junctions are relatively well
annotated, so using --reference with --use-reference-junctions can help to
improve gene connectivity, and because the reference junctions are assigned
a count of zero during quantification, won't bias the expression estimates.
We would not recommend using reference TSS and TES's unless there are no other
options. We have found them to be of much lower quality than other transcript
elements. Instead, looking for publically avbailable CAGE and poly(A)-site-seq
data in a matching tissue or cell line seems to perform better (in human, FANTOM
and Merck poly(A) data are good resources. In fly, modENCODE has all the data
types).
################################################################################
Control File Example:
################################################################################
Instead of the above commands, we could also run:
run_grit.py --control AdMatedF_Ecl_20days_Heads.control.txt
Here's an example control file.
# comment line
# *'s indicate merged
#sample_type rep_id assay paired stranded read_type filename
AdMatedF_20days_Heads rep1 rnaseq true true auto AdMatedF_Ecl_20days_Heads.biorep1.rnaseq.chr4.bam
AdMatedF_20days_Heads rep2 rnaseq true true auto AdMatedF_Ecl_20days_Heads.biorep2.rnaseq.chr4.bam
AdMatedF_20days_Heads * cage false true auto AdMatedF_Ecl_20days_Heads.biorep1.cage.chr4.bam
AdMatedF_20days_Heads * polya false true auto AdMatedF_Ecl_20days_Heads.biorep2.passeq.chr4.bam
sample_type - a unique identifier for the biological sample type.
rep_id - a unique identifier for the replicate
Data from the same sample type is merged for building elements (e.g. junction
discovery, exon building, etc.) but transcripts are quantified on individual
replicates. A '*' as a rep_id indicates that the data should be used for all
replicates for that sample type.
If multiple bam files with the same sample_type and rep_id are provided, they
will be merged. So, for instance, if you have an RNAseq experiment where one
has a read_type of forward, and one has a read_type of backward, then they
should each be provided a line with different read types.
Again, in practice we would probably run:
run_grit --control AdMatedF_Ecl_20days_Heads.control.txt --verbose --ucsc \
-t 16 --reference flybase-r5.45.chr4.gtf --use-refernce-junctions \
--fasta dm_rel_5.fa
################################################################################
Peak Calling Example:
################################################################################
In addition to discovering and building transcript models, GRIT can be used to
identify TSS/TES from RAMPAGE/CAGE/PASseq data and an RNA-seq control. The
ENCODE consortium uses the call_peaks script distributed with GRIT v2.0.4
to identify RAMPAGE peaks. Installation is identical to isntalling the full GRIT
software pacakge, as described above.
To try calling peaks using the ENCODE RNA evaluation data (XXX)
for download a position sorted and indexed RAMPAGE file (eg XXX), a matching
RNAseq sorted and indexed bam file (eg XXX), and a reference annotation in GTF
format (eg XXX). To download test data, in a suitable working directory, run:
wget -r -nd --no-parent -A '*.bam*,*.gtf' \
http://mitra.stanford.edu/kundaje/nboley/grit-bio.org/ENCODE_RAMPAGE_TEST/;
The exact commands for a linux machine with GRIT installed can be found below.
Then run:
call_peaks --rampage-reads $RAMPAGEReads.sorted.bam \
--rnaseq-reads $RNASeqReads.sorted.bam \
--reference $referenceAnnotation.gtf \
--exp-filter-fraction 0.05 --trim-fraction 0.01 \
--ucsc \
--outfname peaks.gff --outfname-type gff \
--bed-peaks-ofname peaks.bed \
--threads $nThreads
Here we describe the individual options used above - the full set of options and
descriptions can be found by running call_peaks --help.
--rampage-reads: a position sorted, indexed bam file containing rampage reads to
identify peaks from.
The ENCODE consoritum uses STAR to map the RAMPAGE reads,
PCR remove duplicates, identify junctions, etc. Details of the
mapping procedure can be found here:
https://github.com/ENCODE-DCC/long-rna-seq-pipeline
If a reference annotation is not provided, then the user must
set the --rampage-read-type option to forward or backward. If
this option is set to forward, then if the first read pair
maps to the reference genome without being reverse complemented
then the read is assumed to have come from a + stranded gene.
--rnaseq-reads: a position sorted, indexed bam file containing rnaseq reads
from the same sample as the --rampage-reads. The RNA-seq reads
help GRIT distinguish noise in a similar way to how a DNA seq
experiment is used to control for a ChIP-seq experiment. GRIT
currently requires RNAseq reads generated using a stranded
protocol.
The ENCODE consoritum uses STAR to map the RNAseq reads;
details of the ENCODE mapping pipeline can be found here:
https://github.com/ENCODE-DCC/long-rna-seq-pipeline
If a reference annotation is not provided, then the user must
set the --rnaseq-read-type option to forward or backward. If
this option is set to forward, then if the first read pair
maps to the reference genome without being reverse complemented
then the read is assumed to have come from a + stranded gene.
--reference: a GTF file containing reference transcripts. GRIT does not
require a reference annotation, but providing one helps
improve the quality of the result by providing connectivity
between transcribed regions. A reference annotation is also
required to automatically infer read strand.
--exp-filter-fraction: peaks with low expression levels *relative to the gene
region in which they lie* will be filtered regardless of their
estimated statistical enrichment. This helps the results to be
robust to differences between the RAMPAGE and RNAseq
experiments. The ENCODE consortium sets this value to 0.05
which means that a peak will be removed if there is another
peak in the same gene with an expression value >= 20x higher.
5% is relatively conservative - those primarily focused on new
discovery that have properly matched RAMPAGE/RNAseq experiments
should feel comfortable lowering this threshold to 1%. If there
is not a control RNASeq experiments from a matching biological
sample then values as high as 25% may be appropriate.
--trim-fraction: Specify how much to trim the edge of peaks. ENCODE uses a value
of 0.01, which specifies that each boundary will be trimmed
while the trimming has removed less than 1% of the total reads
that fall into the peak.
--ucsc: Use contig names compatible with the ucsc browser (this
typically just involves prepending a chr to the contig names)
--outfname: The name of the output file.
--outfname-type: The type of the output file. We recommend using the gff output
whenever possible as future updates will include additional
meta data which will likely break bed backwards compatibility.
--bed-peaks-ofname: Write an additional output file in bed format. This is
primarily used for downstream analysis tools that require bed
input.
--threads: The number of processes to run concurrently.
--verbose: Open a ncurses interface which allows each stage to be
monitored.
## Ouput Formats ###############################################################
gff:
gff output is standard gff output with the following id files:
gene_id: The ID of the GRIT determined gene region this peaks falls in.
gene_name: Currenty identical to gene_id - in the future this will contain
the name of the nearest matching reference gene.
tss_id: A unique identifier for this TSS id.
peak_cov: The RAMPAGE read counts falling within this peak. This is
useful for identifying peak summits, identifying peak
enrichment, identifying binding motifs associated with this
peak, etc.
bed:
The bed output is a standard bed6 with the following additional fields:
read countq : int
gene_id : string
gene_name : string
tss_id : string
peak_coverage : comma separated floats specifying the observed read coverage
across the peak
## Running IDR on peak replicates ##############################################
The ENCODE consortium uses the IDR software package (
https://github.com/nboley/idr/) to identify peaks which are reproducible between
replicates. Given the bed output from two peak calling runs (e.g. PEAK1.bed and
PEAK2.bed) by running:
idr --samples REP1.bed REP2.bed # run IDR on the REP1.bed and REP2.bed
--input-file-type bed --rank 7 # specifies that the input is a bed format,
# and the score track is in column 7
--output-file IDR.peaks.txt # output the peak list and reproducibility
# information to IDR.peaks.txt
--plot # produce a QC plot, IDR.peaks.txt.png
Details about command line options and a description of the output files can be
found at https://github.com/nboley/idr/.
## Example commands for ENCODE test data #######################################
For a linux machine with wget, GRIT and IDR installed, in a suitable working
directory run:
wget -r -nd --no-parent -A '*.bam*,*.gtf' \
http://mitra.stanford.edu/kundaje/nboley/grit-bio.org/ENCODE_RAMPAGE_TEST/;
call_peaks --rampage-reads ENCFF001RDX-ENCFF001RDR_rampage_star_marked.chr20.bam \
--rnaseq-reads ENCFF001RFD-ENCFF001RFC_star_genome.chr20.bam \
--reference gencode.v19.annotation.chr20.gtf \
--exp-filter-fraction 0.05 --trim-fraction 0.01 \
--ucsc \
--outfname peaks.rep1.gff --outfname-type gff \
--bed-peaks-ofname peaks.rep1.bed \
--threads 24;
call_peaks --rampage-reads ENCFF001RET-ENCFF001RES_rampage_star_marked.chr20.bam \
--rnaseq-reads ENCFF001RFF-ENCFF001RFE_star_genome.chr20.bam \
--reference gencode.v19.annotation.chr20.gtf \
--exp-filter-fraction 0.05 --trim-fraction 0.01 \
--ucsc \
--outfname peaks.rep2.gff --outfname-type gff \
--bed-peaks-ofname peaks.rep2.bed \
--threads 24;
idr --samples peaks.rep1.bed peaks.rep2.bed \
--input-file-type bed --rank 7 \
--output-file peaks.IDR.txt --plot;
################################################################################
Command Line Options:
################################################################################
Run 'run_grit --help' for a complete list. They should be well described.
################################################################################
Questions/bug reports:
################################################################################
Please direct questions and bug reports to the mailing list (
http://groups.google.com/group/grit-bio ). I'll try to respond promptly to
bug reports, so please do not hesitate to contact me with questions.