-
Notifications
You must be signed in to change notification settings - Fork 16
/
analysis_steps.md
781 lines (574 loc) · 31.8 KB
/
analysis_steps.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
<!---
NOTE:
If you're reading this, instead try opening README.html in a web browser
or view this file from within the github repository website.
This is a github-flavored markdown file not meant to be easily readable.
-->
Analysis steps
==============
- [Optional reference sequence correction](#optional-reference-sequence-correction)
- [Read basecall quality trimming](#initial-basecall-quality-trimming)
- [Paired read merging](#paired-read-merging)
- [Alignment to reference sequences](#alignment-to-reference-sequences)
- [Primer trimming and enforcement of read location requirements](#primer-trimming-and-enforcement-of-read-location-requirements)
- [Ambiguously aligned mutation handling](#ambiguously-aligned-mutation-handling)
- [Multinucleotide mutation handling](#multinucleotide-mutation-handling)
- [Post-alignment basecall quality filtering](#post-alignment-basecall-quality-filtering)
- [Chemical adduct location inference](#chemical-adduct-location-inference)
- [Calculation of mutation rates](#calculation-of-mutation-rates)
- [Reactivity profile calculation and normalization](#reactivity-profile-calculation-and-normalization)
- [Quality control checks](#quality-control-checks)
Steps are listed in order of execution, with the exception of
reference sequence correction, which is a larger preprocessing
stage that includes many of the other listed stages.
---
Optional reference sequence correction
--------------------------------------
If ShapeMapper is given
<kbd>--correct-seq</kbd> and reads from an unmodified sample,
reference sequences will be automatically corrected before any SHAPE analysis.
Specifically, ShapeMapper performs sequence alignment, identifies mutations
occurring with frequencies above 60%, and changes the reference sequence(s)
to match. Adjusted FASTA files are produced in the main output folder, and
the ShapeMapper log will include messages detailing the sequence changes made
and warning if other high but sub-threshold mutations are detected.
This sequence correction is useful for cases in which one major RNA species is
present and contains sequence changes compared to a reference sequence (such as
when studying an RNA virus whose sequence may have drifted from a reference strain).
However, this is insufficient for samples containing a mixture of related but unknown
sequences (such as studying an mRNA in the presence of transcribed pseudogenes of
similar sequence). In such cases, see [FAQ](FAQ.md#multiple-loci).
#### Details
Reads with aligner-reported mapping qualities (SAM MAPQ field) below 10 are excluded
within this stage.
Within this stage, nearby
mutations are _not_ treated as larger multinucleotide
mutations, and ambiguously aligned mutations are _not_ realigned,
since these operations make it more difficult to estimate the frequency of SNPs.
This is different compared to how mutations are handled when estimating relative
SHAPE adduct frequencies - see [Multinucleotide mutation handling](#multinucleotide-mutation-handling)
and [Ambiguously aligned mutation handling](#ambiguously-aligned-mutation-handling).
#### Parameters
<kbd>--min-seq-depth</kbd>   Minimum read depth required to include a mutation at a given position.
<kbd>--min-freq</kbd>   Minimum estimated mutation frequency to include a mutation.
#### Example command line
`shapemapper --name example --target TPP.fa --out --amplicon TPP_mutant --correct-seq --folder TPPminus --modified --folder TPPplus --untreated --folder TPPminus`
---
Initial basecall quality trimming
---------------------------------
Reads are scanned from left to right, and if a window of
<kbd>--window-to-trim</kbd> nucleotides
is found with less than
<kbd>--min-qual-to-trim</kbd>
average basecall quality score (phred score),
that window and any nucleotides to the right are removed from the read. If the remaining
read is shorter than
<kbd>--min-length-to-trim</kbd>
, it is removed
(for file formatting purposes,
it is actually set to a placeholder read containing one 'N',
and later ignored in alignment).
The default trimming settings are intentionally somewhat permissive, and will allow reads through
with some bad basecalls, especially if bad basecalls are isolated (i.e. a single bad
position rather than a run of them). After alignment, a more stringent filtering step will
exclude bad basecalls from contributing to identified mutations and effective read depths.
See [Post-alignment basecall quality filtering](#post-alignment-basecall-quality-filtering).
---
Paired read merging
-------------------
If paired reads are provided as input
(with
<kbd>--folder <folder></kbd>
or with
<kbd>--R1 <r1.fastq> --R2 <r2.fastq></kbd>
),
read pairs are merged using [BBmerge v37.78](https://sourceforge.net/projects/bbmap/) with
the <kbd>vstrict=t</kbd> option. Unmerged reads are *not* thrown out.
Reads after this stage of processing can be optionally output by passing the
<kbd>--output-processed-reads</kbd> option.
---
Alignment to reference sequences
--------------------------------
Read alignment to one or more reference sequences is performed with either
[Bowtie2 v2.3.4.3](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) (default), or
[STAR v2.5.2a](https://github.com/alexdobin/STAR) by passing the
<kbd>--star-aligner</kbd> option. Intermediate read alignment
files (`*.sam`) from this stage can be optionally output by passing the
<kbd>--output-aligned</kbd>
option.
### General considerations
#### Choice of aligner
Both aligners produce nearly identical reactivity profiles for a SHAPE dataset from
extracted _E. coli_ ribosomal RNA, read out by MaP in randomly primed format (see
SFig. 6 in [Busan and Weeks, 2018](http://rnajournal.cshlp.org/content/early/2017/11/07/rna.061945.117)).
Bowtie2 is the best choice for general use. STAR can be a good choice for long or
many reference sequences (such as aligning to viral genomes, lncRNAs, or many mRNAs), and is often
dramatically faster than Bowtie2 for large reference sequences. However, STAR's performance
may suffer if a sizable fraction of input reads do not map to any input reference sequence.
At least with the current parameters used by ShapeMapper, STAR may also be slightly less sensitive
than Bowtie2 (that is, it may fail to align a small fraction of reads that "should have" aligned).
#### Soft clipping
Alignments are performed in local mode (also known as soft-clipped), that is,
attempting to maximize alignment score over a sizable fraction of the read length, but
removing positions on either end that do not agree with the reference sequence. This
helps exclude bad basecalls within the first few nucleotides of sequencing and most adapter sequence.
In practice, soft clipping seems to largely obviate the need for an explicit adapter trimming step (not included
in ShapeMapper). Soft clipping can be disabled in ShapeMapper with the
<kbd>--disable-soft-clipping</kbd>
option, but this is not recommended.
#### Read length
Both STAR and Bowtie2 are short
read aligners, and will not perform well (and will very likely crash) on
extremely long reads, such as those produced by PacBio or Oxford Nanopore
sequencing platforms. This is an area for future work, and will involve
incorporating a long-read aligner into ShapeMapper. If existing long-read
alignments can be coerced into a SAM format, it may be possible to run
`internals/bin/shapemapper_mutation_parser` followed by `internals/bin/shapemapper_mutation_counter`,
but this is untested.
#### Paired read handling
If input reads are unpaired (<kbd>--U</kbd> or <kbd>--unpaired-folder</kbd>),
all reads are aligned in unpaired mode. If input reads are paired
(<kbd>--R1</kbd>|<kbd>--R2</kbd> or <kbd>--folder</kbd>), merged
reads are aligned in unpaired mode, and unmerged reads are aligned
in paired mode.
### Aligner parameters
For reference, the parameters passed to each aligner are explained here. Most
of these parameters are _not_ adjustable by the ShapeMapper user.
#### Bowtie2 parameters
<details>
<kbd>-p</kbd> with the value of ShapeMapper's <kbd>--nproc</kbd> (default=4). <p></p>
<kbd>--local --sensitive-local</kbd> is used, unless the ShapeMapper parameter
<kbd>--disable-soft-clipping</kbd>
is given, in which case
<kbd>--end-to-end --sensitive</kbd> is used.
<kbd>--maxins</kbd> with the value of ShapeMapper's
<kbd>--max-paired-fragment-length</kbd> (default=800).
<kbd>--ignore-quals</kbd>
<kbd>--no-unal</kbd>
<kbd>--mp 3,1</kbd>   Intended to be slightly more consistent with STAR's alignment parameters.
<kbd>--rdg 5,1 --rfg 5,1</kbd>   Reduce penalty for gap extension, since multinucleotide deletions are often a large part of MaP signal.
<kbd>--dpad 30</kbd>   Allow somewhat larger deletions than bowtie2 default.</details>
If <kbd>--max-reseed</kbd> and/or <kbd>--max-search-depth</kbd> is provided, explicit params
"-D <max_search_depth> -R <max_reseed> -N 0 -L <seed_len> -i S,1,0.75" are used instead of
<kbd>--sensitive</kbd> or <kbd>--sensitive-local</kbd>, where
seed_len is 20 by default, or 22 if <kbd>--disable-soft-clipping</kbd> is provided.
<h4><i>Other notes:</i></h4>
Merged and unmerged reads are provided to bowtie2 through a single tab6-formatted input,
which supports mixed paired/unpaired reads.
#### STAR parameters
<details>
<kbd>--runThreadN</kbd> with the value of ShapeMapper's <kbd>--nproc</kbd> parameter (default=4). <p></p>
<kbd>--alignEndsType Local</kbd> is used, unless the ShapeMapper parameter
<kbd>--disable-soft-clipping</kbd>
is given, in which case <kbd>--alignEndsType EndToEnd</kbd> is used.
<kbd>--scoreGap -1000000</kbd>   Disable splice junction detection.
<kbd>--scoreDelBase -1 --scoreInsBase -1</kbd>   Reduce penalty for gap extension, since multinucleotide deletions are often a large part of MaP signal.
<kbd>--outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 999</kbd>   Disable filtering by mismatch counts, since highly modified RNAs will produce some reads with many mutations.
<kbd>--outMultimapperOrder Random --outSAMmultNmax 1</kbd>   Only report one of the top alignments for a read that has an equivalent alignment score at multiple locations.
<kbd>--outSAMattributes MD</kbd>   Include MD tag in SAM output (currently required by ShapeMapper mutation parser).
</details>
<h4><i>Other notes:</i></h4>
STAR does not appear to have an option analogous to bowtie2's
<kbd>--maxins</kbd>, so ShapeMapper
reimplements a check for concordant paired alignment in the mutation parsing stage.
To build a reference sequence index, STAR is run
with <kbd>--genomeSAindexNbases</kbd> and
<kbd>--genomeChrBinNbits</kbd> parameter values calculated as
recommended by the STAR manual.
Some STAR runs with short reference sequences may
nonetheless crash, which may require manually setting the
<kbd>--genomeSAindexNbases</kbd> parameter
to a lower value (this parameter is available through the main ShapeMapper executable).
ShapeMapper can be configured to automatically rerun analyses in which STAR
segfaulted by providing
<kbd>--rerun-on-segfault</kbd>
and
<kbd>--rerun-genomeSAindexNbase</kbd>
(default=3).
Since STAR does not support mixed paired/unpaired input, two separate STAR instances are
run, which may increase memory requirements if especially large reference sequences are used.
ShapeMapper disables STAR's shared-memory index mode by default, but it can be re-enabled by providing
<kbd>--star-shared-index</kbd> (untested).
---
Primer trimming and enforcement of read location requirements
-------------------------------------------------------------
If <kbd>--amplicon</kbd> and/or <kbd>--primers</kbd> with a [primers file](#primers-file) are
provided, [directed primer trimming](#directed-primer-trimming) is performed.
Otherwise, [random primer trimming](#random-primer-trimming) is performed,
using the value of <kbd>--random-primer-len</kbd>.
### Random primer trimming
Mutations overlapping
<tt>(</tt><kbd>--random-primer-len</kbd><tt> + 1)</tt>
nucleotides on the 3′ end
of reads are discarded, and
<tt>(</tt><kbd>--random-primer-len</kbd><tt> + 1)</tt> nucleotides are
also excluded from contributing to effective read depth.
This trimming is not needed for Nextera-style preps, as
the ends of cDNAs are largely digested away before sequencing.
### Directed primer trimming
For amplicon datasets, reads are required to align with their ends within
+/-
<kbd>--max-primer-offset</kbd>
nucleotides of expected PCR primer pairs. This can help to
reduce the effects of off-target reverse transcription and PCR products
(especially shorter products).
Specifically, for unpaired, paired, or merged reads, each read end is required to
align within <kbd>--max-primer-offset</kbd>
of a primer end location, and the nearest primer sites for each end are required to
come from the same primer pair. For paired reads missing a mate pair, only one end
is required to map near a primer (out of forward primers for reads on the forward strand
and out of reverse primers for reads on the reverse strand).
Reads whose aligned locations do not meet requirements are excluded from further
analysis, and the locations of these reads are shown in mapped depth plots as
dashed black lines, labeled "off-target".
For reads meeting requirements, mutations overlapping either primer site are
discarded, and primer sites are excluded from contributing to effective read depth.
#### Usage
For most users, simply set primer binding sites to lowercase sequence in the input
.fa file, and run ShapeMapper with the <kbd>--amplicon</kbd> option. This will handle
cases with a single pair of primers on either end of the target.
For more complex cases with internal PCR primer locations or multiple pairs of primers, run
ShapeMapper with a [primers file](file_formats.md#primers-file)
input with
<kbd>--primers <primers_file></kbd>
. ShapeMapper will automatically determine
the expected locations of input PCR primers by comparing their sequences with those of
reference targets.
### Excluding reads aligning to multiple locations
Reads with aligner-reported MAPQ values below
<kbd>--min-mapq</kbd>
are excluded from further analysis.
The mapped locations of these excluded reads are shown in mapped depth plots
as dotted red lines, annotated "low MAPQ".
If multiple RNA targets share much of their sequence, low MAPQ reads typically lack
enough sequence unique to one particular target. In
other cases, low MAPQ reads are simply short, low complexity reads that spuriously
align to multiple homopolymeric portions of a target sequence.
__Note:__ When using Bowtie2, high numbers of mutations in a read will result in lower
aligner-reported mapping qualities. Therefore, raising the value of <kbd>--min-mapq</kbd>
may have the side effect of excluding highly mutated reads.
### Mapped depth plots
ShapeMapper produces `*_mapped_depths.pdf` files in the main output folder
to visualize the effects of this analysis stage and, with directed primers,
to break down the contributions of each primer pair to overall read depth.
*Example plots:*
![mapped depth example](images/mapped_depth1.png)
Only reads plotted with solid lines are used in further analysis, and only
the shaded portion of each amplicon is included.
![mapped depth example](images/mapped_depth2.png)
Multiple, even overlapping primer pairs are supported. Note in the above plot
that apparent primer dimer or leftover primer reads are excluded -
see for example the unshaded regions of the green rightmost primer pair 13.
---
Post-alignment paired read merging
----------------------------------
For paired reads that fail to merge before alignment, but where both
mates align to the same target for a fragment size <=
<kbd>--max-paired-fragment-length</kbd>
, ShapeMapper merges the mates into a single
read, resolving any sequence or mutation conflicts attempting to favor
higher-quality basecalls.
#### Basecall quality score resolution
At a given position where mate pairs overlap, the higher basecall quality score
is used for the merged read.
#### Mutation resolution
Mutations that overlap are first grouped. For each group of mutations,
the mean phred score over
the set of basecalls within and adjacent to R1 mutations
is compared to the mean phred score over R2 mutations
and adjacent positions. The read mutation group with the higher mean phred score is
then used for the merged read. This strategy avoids rare cases in which
a more naïve mutation resolution algorithm results in a merged read that is
incompatible with both mate pairs.
---
Ambiguously aligned mutation handling
-------------------------------------
Early versions of ShapeMapper entirely excluded ambiguously aligned mutations
from analysis, since they make it impossible to infer adduct location
with full single nucleotide accuracy. However, short ambiguous deletions
are a fairly large signal within many MaP datasets, and we find that
realigning ambiguous mutations 5′ and including them in analysis
produces empirical improvements in reactivity profile accuracy.
See SFig. 2 in [Busan and Weeks, 2018](http://rnajournal.cshlp.org/content/early/2017/11/07/rna.061945.117).
The default behavior (5′ realignment) can be reversed (3′ realignment) by
passing the
<kbd>--right-align-ambig</kbd>
option (not recommended).
*Example realignment:* A five-nucleotide deletion can be equivalently
aligned to two locations. ShapeMapper uses the 5′ location.
3′ aligned read: GAGGAAGGTGGGGATGACGTCA-----TCATGGCCCTTACG
5′ aligned read: GAGGAAGGTGGGGATGAC-----GTCATCATGGCCCTTACG
target: GAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACG
^
inferred adduct site
We hypothesize that most short ambiguous deletions are a result of
reverse transcriptase encountering an adduct, then stalling until a
few nucleotides of the cDNA strand locally melt and reanneal to another
complementary position within the RNA a few nucleotides toward the 5′ end.
Repeat and/or homopolymeric regions still pose difficulties for MaP, both due to
high background reverse transcriptase mutagenesis within these regions and
due to frequent loss of information regarding adduct position. Future
ShapeMapper versions may track and warn the user about such regions.
---
Multinucleotide mutation handling
---------------------------------
We observe that SuperScript II reverse transcription
under MaP conditions on SHAPE-modified RNA produces some reads
with multiple nearby mutations, and hypothesize that many of these
mutation clusters in fact reflect single adducts toward the 3′ end
of a given cluster. ShapeMapper therefore applies a mutation separation
threshold to collapse nearby mutations.
Mutations separated by less than <kbd>--min-mutation-separation</kbd> unchanged
reference sequence nucleotides are collapsed and treated as a single mutation.
This is not applied within sequence correction stages if present (
<kbd>--correct-seq</kbd>
samples).
The default mutation separation threshold (6 nucleotides) was empirically
selected to maximize reactivity profile accuracy for an
extracted _E. coli_ rRNA SHAPE dataset
(see SFig. 1 in [Busan and Weeks, 2018](http://rnajournal.cshlp.org/content/early/2017/11/07/rna.061945.117)).
The same threshold also appears optimal for DMS adducts (unpublished data). Alternative
reverse transcriptases, reverse transcription conditions, or adduct chemistries
may require re-optimization.
_Example:_
![multinuc mutation](images/multinuc_mutations2.png)
---
Post-alignment basecall quality filtering
-----------------------------------------
Initial read quality trimming is fairly permissive to support
high read depths and to permit reads with a few bad basecalls
to be used as long as bad basecalls are isolated (see
[Initial basecall quality trimming](#initial-basecall-quality-trimming)).
After alignment, ShapeMapper applies a separate, more stringent
basecall quality filter to remove mutations and read depths associated
with basecalls with quality scores below
<kbd>--min-qual-to-count</kbd>.
For sequencing runs of mixed quality (such as runs approaching the length
limits of Illumina chemistry), this strategy enables recovery of
high read depths without sacrificing reactivity profile accuracy.
See SFig. 1 in [Busan and Weeks, 2018](http://rnajournal.cshlp.org/content/early/2017/11/07/rna.061945.117).
Specifically,
for each nucleotide position not within mutations (that is, a position covered by a read
but unchanged from reference sequence), a given position is excluded if its basecall,
the basecall immediately 5′, or the basecall immediately 3′ has
a quality score below <kbd>--min-qual</kbd>.
A given mutation is excluded (and the covered sequence region) if it contains
any basecall with quality score below <kbd>--min-qual</kbd>, or if
the basecall immediately 5′ or the basecall immediately 3′ has
a quality score below <kbd>--min-qual</kbd>.
_Example 1:_
![basecall qualities simple](images/basecall_quality1.png)
A poorly scoring portion of the read is excluded from analysis.
_Example 2:_
![basecall qualities example](images/basecall_quality2.png)
Poorly scoring regions of the read are excluded, and one mutation
associated with low-quality basecalls is excluded.
---
Chemical adduct location inference
----------------------------------
ShapeMapper infers the locations of chemical adducts as the reference
sequence position immediately 5′ of the last unchanged reference
nucleotide before a mutation scanning 3′→5′ (as in reverse transcription).
_Examples:_
![adduct locations](images/adduct_inference1.png)
---
Calculation of mutation rates
-----------------------------
### Mutation and read depth counting
After all mutation and read depth processing and filtering
steps are performed, ShapeMapper sums mutation counts and
read depths for each position in each RNA sequence.
### Effective read depth
Simple sequencing read depths are computed as the number of reads crossing
a given position in the target sequence. For computing mutation rates,
however, these depths are somewhat inadequate. The mutation rate at a
given position should be computed as the number of observed mutations
divided by the total number of observations (i.e. the number of
"no mutation" observations plus the number of "mutation" observations).
A given position within the region covered by a multinucleotide mutation is
not actually an observation of "no mutation" - we are effectively blind
over these regions in individual reads. To help account for this, the region over
each mutation is excluded from contributing to the effective read depth,
with the exception of the inferred adduct site (see examples above). This
correction is not applied to sequence correction stages. In practice,
excluding multinucleotide sites from read depths is usually a small change
to a large denominator, and has a negligable effect on reactivity profile
accuracy, at least with mutation rates in the range produced by SHAPE adducts.
Effective read depths as reported by ShapeMapper also exclude low quality
basecalls (see [Post-alignment basecall quality filtering](#post-alignment-basecall-quality-filtering)).
_Optional outputs:_
Intermediate parsed and processed mutations associated
with individual reads can be optionally output by passing the
<kbd>--output-parsed-mutations</kbd> option
(see [file format](file_formats.md#parsed-mutations)).
Tables of mutation counts broken down by mutation classification
can be optionally output by passing the
<kbd>--output-counted-mutations</kbd>
(see [file format](file_formats.md#mutation-counts)).
Tabular histograms of read lengths and per-read mutation counts
can be optionally output in the main ShapeMapper log by passing
the <kbd>--per-read-histograms</kbd>.
### Rate calculation
At each position in a given RNA, mutation rate
for each provided sample is calculated as the mutation count
divided by the effective read depth at that position. In
output tab-delimited `*_profile.txt` files, the relevant
columns are `<sample>_mutations` (mutation counts),
`<sample>_effective_depth` (effective read depths), and
`<sample>_rate` (mutation rate).
_Output plots:_
`*_profiles.pdf` output files include panels for read depths and
mutation rates.
![read depths example](images/read_depths_TPP.png)
Read depths for each sample are plotted with solid lines,
and effective read depths are plotted with
light lines.
![mutation rates example](images/mutation_rates_TPP.png)
Mutation rates plotted with
lines, and standard errors plotted as shaded areas. Mutation rate standard
errors are calculated as `sqrt(mutation rate) / sqrt(effective depth)`.
---
Reactivity profile calculation and normalization
------------------------------------------------
_Sample key:_ `M`: Modified, `U`: Untreated control, `D`: Denatured control
#### Reactivity
If three samples are provided, reactivity is calculated as
   <tt>(rate<sub>M</sub> - rate<sub>U</sub>) / rate<sub>D</sub></tt>
If two samples are provided, reactivity is calculated as
   <tt>rate<sub>M</sub> - rate<sub>U</sub></tt>
If only one sample is provided, reactivity is simply
   <tt>rate<sub>M</sub></tt>
#### Standard error
If three samples are provided, reactivity stderr is calculated as
   <tt>sqrt( stderr<sub>M</sub> / rate<sub>D</sub>)<sup>2</sup> +
(stderr<sub>U</sub> / rate<sub>D</sub>)<sup>2</sup> +
(stderr<sub>D</sub> × (rate<sub>M</sub> - rate<sub>U</sub>) / rate<sub>D</sub><sup>2</sup>)<sup>2</sup> )</tt>
where rate and rate stderr are calculated as [above](#calculation-of-mutation-rates).
If two samples are provided, reactivity stderr is calculated as
   <tt>sqrt(stderr<sub>M</sub><sup>2</sup> + stderr<sub>U</sub><sup>2</sup>)</tt>
If only one sample is provided, reactivity stderr is simply
   <tt>stderr<sub>M</sub></tt>
Reactivity standard errors are shown as error bars in `*_profile.pdf` plots.
#### Excluded positions
Lowercase sequence is excluded from reactivity profiles. Positions
with effective read depth in any sample below
<kbd>--min-depth</kbd>
(default=5000) are
excluded. If an untreated control sample is provided, positions with
untreated mutation rate above
<kbd>--max-bg</kbd>
(default=0.05) are excluded.
#### Normalization
Over the set of all RNAs and nucleotide positions without
masked (lowercase) sequence, high background, or low read depth,
reactivities are normalized by dividing
by the mean reactivity of the top 10% of reactivities after
reactivities above a threshold are excluded
(see section 3.2.1 in [Low and Weeks, 2010](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2941709/)).
That threshold is selected from
the largest value out of
[1.5 × interquartile range,
90th percentile (if total seq length > 100) or 95th percentile (if total seq length < 100)].
By default, ShapeMapper normalizes all profiles together -
that is, using the combined set of reactivities to compute
a single normalization factor that is applied to all profiles.
This can be disabled by passing the <kbd>--indiv-norm</kbd> option.
_Example:_
![reactivity example](images/reactivity_TPP.png)
_Limitations:_
SHAPE-MaP measures relative probe reactivities, and normalization
is an attempt to put these relative values on a more meaningful
scale, closer to what might ideally be achieved with an absolute
measurement of nucleotide conformational state. However, this normalization
relies on the assumption that the distribution of observed reactivities
in a particular experiment matches the distribution of reactivities
observed in experiments used to develop pseudo-free energy structure
modeling constraint parameters. In particular, normalization
will produce systemically incorrect results for highly single-stranded RNAs or
extremely highly base-paired RNAs.
Future work may attempt to fit relative reactivity distributions to empirical
models of the distributions of reactivities observed for paired and unpaired
nucleotides. This would allow more accurate
structure modeling for highly structured or highly unstructured RNAs
as well as providing an estimate of overall base pairing percentage.
---
Quality control checks
----------------------
ShapeMapper performs several basic quality-control checks for each
reactivity profile. These are necessarily heuristic, since different
downstream analyses require different levels of data quality, and
since RNAs of differing flexibility will show different overall
signal levels above background. In general, more sequencing read
depth is always helpful, as are higher modification/mutation rates
(to a point).
If ShapeMapper gives a red warning message about possible low-quality
reactivity profiles, read the log file to see which quality control
checks failed, and refer to this section for possible remedies.
#### Read depth check
ShapeMapper requires that at least 80% of non-masked nucleotides meet a minimum
sequencing depth of 5000 in all provided samples. Note that with
uneven sequencing coverage, some regions of an RNA may have
higher-quality reactivity data than other regions. For example,
coverage is often lower near transcript ends.
If read depths are low, check alignment stats in the logfile to see the
amount of target sequence present in each sample. Better target enrichment or
recovery and/or additional sequencing can often help.
#### Positive mutation rates above background check
If an untreated control sample is provided (highly recommended), ShapeMapper
requires that at least 50% of non-masked nucleotides with depths above
5000 have a higher mutation rate in the modified sample than in the
untreated sample.
#### High background mutation rates check
If an untreated control sample is provided, ShapeMapper requires that no more
than 5% of non-masked nucleotide with depths above 5000 have an
untreated mutation rate above 0.05.
An unusual number of high-background nucleotides can result from the
presence of native modifications or, more often, a subpopulation of
sequence variants or transcribed pseudogenes. There is no
built-in method to automatically determine the sequences of multiple
sequence variants in a mixture, but it can sometimes be helpful to
pass the <kbd>--render-mutations</kbd> option and visually inspect the PDF outputs
from the untreated sample, looking for repeated instances of the same
mutation pattern from distinct reads. By default, these outputs are limited to
100 pages, controlled with the
<kbd>--max-pages</kbd> parameter. The pages are wide enough
by default to accomodate a maximum paired read length of 800 nts, but this is
often very zoomed out; lower the
<kbd>--max-paired-fragment-length</kbd> as needed for
visualization.
#### Number of highly reactive nucleotides check
ShapeMapper requires that at least 8% of non-masked nucleotides with depths
above 5000 have a modified mutation rate above 0.006 after background
subtraction.
Possible causes for failure:
- DNA contamination. Unusually low background mutation rates can
be a secondary indication that this is the problem (since
reverse transcription under MaP conditions usually generates
some errors).
- Poor mixing of chemical reagents and RNA and/or poor reagent
diffusion (if modifying in cells), resulting in low
modification rates
- Expired reagents, resulting in low modification rates
- Poor reverse transcription conditions, resulting in low adduct
read-through
- Extremely highly structured RNA. A molecule that genuinely
contains no flexible nucleotides is indistinguishable from a
highly flexible one that was mistakenly unmodified (they will
both have low mutation rates above background). In this case,
additional control experiments or complementary techniques may
be needed.
[← back to README](../README.md)