-
Notifications
You must be signed in to change notification settings - Fork 3
/
QuasR.Rmd
1694 lines (1462 loc) · 86.4 KB
/
QuasR.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "An introduction to QuasR"
date: "`r format(Sys.time(), '%d %B, %Y')`"
bibliography: QuasR-refs.bib
author:
- Michael Stadler
- Dimos Gaidatzis
- Charlotte Soneson
- Anita Lerch
package: QuasR
output:
BiocStyle::html_document:
toc_float: true
vignette: >
%\VignetteIndexEntry{An introduction to QuasR}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: console
---
<!--
\DeclareRobustCommand{\QuasRlogo}{\raisebox{-8mm}{\includegraphics[width=45mm]{QuasR-logo.pdf}}}% use \DeclareRobustCommand instead of \newcommand to avoid premature expansion of \QuasRlogo within by 'titling' package
-->
<img src="QuasR-logo.png" style="position:absolute;top:0px;right:0px;" />
<!--
```{r options, results='hide', echo=FALSE}
#options(width=65)
options('useFancyQuotes' = FALSE, continue = " ", digits = 3)
```
-->
# Introduction
The `r Biocpkg("QuasR")` package (short for <i>Qu</i>antify and <i>a</i>nnotate <i>s</i>hort reads
in <i>R</i>) integrates the functionality of several **R** packages (such as `r Biocpkg("IRanges")` [@IRanges]
and `r Biocpkg("Rsamtools")`) and external software (e.g. `bowtie`, through the
`r Biocpkg("Rbowtie")` package, and `HISAT2`, through the `r Biocpkg("Rhisat2")` package).
The package aims to cover the whole analysis workflow of typical high throughput
sequencing experiments, starting from the raw sequence reads, over pre-processing and
alignment, up to quantification. A single **R** script can contain all steps of a complete
analysis, making it simple to document, reproduce or share the workflow containing all
relevant details.
The current `r Biocpkg("QuasR")` release supports the analysis of single read and
paired-end ChIP-seq (chromatin immuno-precipitation combined with sequencing), RNA-seq
(gene expression profiling by sequencing of RNA) and Bis-seq (measurement of DNA
methylation by sequencing of bisulfite-converted genomic DNA) experiments. It has
been successfully used with data from Illumina, 454 Life Technologies and SOLiD
sequencers, the latter by using bam files created externally of `r Biocpkg("QuasR")`.
# Preliminaries
## Citing `r Biocpkg("QuasR")`
If you use `r Biocpkg("QuasR")` [@QuasR] in your work, you can cite it as follows:
```{r cite, eval=TRUE}
citation("QuasR")
```
## Installation{#installation}
`r Biocpkg("QuasR")` is a package for the **R** computing environment and it is
assumed that you have already installed **R**. See the **R** project at (http://www.r-project.org).
To install the latest version of `r Biocpkg("QuasR")`, you will need to be using the
latest version of **R**. `r Biocpkg("QuasR")` is part of the Bioconductor project
at (http://www.bioconductor.org). To get `r Biocpkg("QuasR")` together with its
dependencies you can use
```{r install, eval=FALSE}
if (!require("BiocManager"))
install.packages("BiocManager")
BiocManager::install("QuasR")
```
Bioconductor works on a 6-monthly official release cycle. As with other Bioconductor
packages, there are always two versions of `r Biocpkg("QuasR")`. Most users will
use the current official release version, which will be installed by `BiocManager::install`
if you are using the current release version of **R**. There is also a development version
of `r Biocpkg("QuasR")` that includes new features due for the next official release.
The development version will be installed if you are using the development version of
**Bioconductor** (see `version = "devel"` in `r Biocpkg("BiocManager")`). The official
release version always has an even second number (for example 1.20.1), whereas the
developmental version has an odd second number (for example 1.21.4).
## Loading of *QuasR* and other required packages
In order to run the code examples in this vignette, the `r Biocpkg("QuasR")` package
and a few additional packages need to be loaded:
```{r loadLibraries, eval=TRUE}
suppressPackageStartupMessages({
library(QuasR)
library(BSgenome)
library(Rsamtools)
library(rtracklayer)
library(GenomicFeatures)
library(txdbmaker)
library(Gviz)
})
```
## How to get help
Most questions about `r Biocpkg("QuasR")` will hopefully be answered by the documentation
or references. If you've run into a question which isn't addressed by the documentation,
or you've found a conflict between the documentation and software, then there
is an active support community which can offer help.
The authors of the package (maintainer: `r maintainer("QuasR")`) always appreciate
receiving reports of bugs in the package functions or in the documentation. The same
goes for well-considered suggestions for improvements.
Any other questions or problems concerning `r Biocpkg("QuasR")` should be posted
to the Bioconductor support site (https://support.bioconductor.org). Users posting
to the support site for the first time should read the helpful posting guide at
(https://support.bioconductor.org/info/faq/). Note that each function in `r Biocpkg("QuasR")`
has it's own help page, as described in the section \@ref(introToR).
Posting etiquette requires that you read the relevant help page carefully before
posting a problem to the site.
# Quick Start
## A brief introduction to **R**{#introToR}
If you already use **R** and know about its interface, just skip this
section and continue with section \@ref(sampleQuasRsession).
The structure of this vignette and in particular this section is based on the excellent
user guide of the `r Biocpkg("limma")` package, which we would like to hereby acknowledge.
**R** is a program for statistical computing. It is a command-driven language meaning
that you have to type commands into it rather than pointing and clicking using a mouse.
In this guide it will be assumed that you have successfully downloaded and installed **R**
from (http://www.r-project.org) as well as `r Biocpkg("QuasR")` (see section \@ref(installation)).
A good way to get started is to type
```{r help1, eval=FALSE}
help.start()
```
at the **R** prompt or, if you're using **R** for Windows, to follow the drop-down
menu items *Help* $\succ$ *Html help*. Following the links *Packages* $\succ$ `r Biocpkg("QuasR")`
from the html help page will lead you to the contents page of help topics for functions
in `r Biocpkg("QuasR")`.
Before you can use any `r Biocpkg("QuasR")` commands you have to load the package
by typing
```{r loadQuasRLibrary, eval=FALSE}
library(QuasR)
```
at the **R** prompt. You can get help on any function in any loaded package by typing
`?` and the function name at the **R** prompt, for example
```{r help2, eval=FALSE}
?preprocessReads
```
or equivalently
```{r help3, eval=FALSE}
help("preprocessReads")
```
for detailed help on the `preprocessReads` function. The function help
page is especially useful to learn about its arguments and its return value.
Working with **R** usually means creating and transforming *objects*.
Objects might include data sets, variables, functions, anything at all.
For example
```{r assign, eval=FALSE}
x <- 2
```
will create a variable `x` and will assign it the value 2. At any stage of your
**R** session you can type
```{r ls, eval=FALSE}
ls()
```
to get a list of all the objects currently existing in your session. You can
display an object by typing its name on the prompt. The following displays the
object `x`:
```{r printObject, eval=FALSE}
x
```
We hope that you can use `r Biocpkg("QuasR")` without having to spend a lot of time
learning about the **R** language itself but a little knowledge in this direction
will be very helpful, especially when you want to do something not explicitly provided
for in `r Biocpkg("QuasR")` or in the other Bioconductor packages. For more details
about the **R** language see *An Introduction to R* which is available from the online
help, or one of the many great online resources, like the documentation at [r-project.org](https://www.r-project.org/),
the growing list of free books at [bioconductor.org](https://bioconductor.org/books/release/),
or the books from [rstudio.com](https://rstudio.com/resources/books/) (many of which are
also available for free). For more background on using **R** for statistical analysis see [@Dalgaard].
## Sample *QuasR* session{#sampleQuasRsession}
This is a quick overview of what an analysis could look like for users preferring
to jump right into an analysis. The example uses data that is provided with the
`r Biocpkg("QuasR")` package, which is first copied to the current working directory,
into a subfolder called `"extdata"`:
```{r SampleSession1, eval=TRUE}
file.copy(system.file(package = "QuasR", "extdata"), ".", recursive = TRUE)
```
The sequence files to be analyzed are listed in `sampleFile` (see section
\@ref(sampleFile) for details).
The sequence reads will be aligned using `bowtie` [@bowtie] (from the `r Biocpkg("Rbowtie")`
package [@Rbowtie]) to a small reference genome (consisting of three short segments
from the hg19 human genome assembly, available in full for example in the
`r Biocpkg("BSgenome.Hsapiens.UCSC.hg19")` package). Information on selecting
an appropriate reference genome is summarized in section \@ref(genome).
Make sure that you have sufficient disk space, both in your **R** temporary
directory (`tempdir()`) as well as to store the resulting alignments
(see section \@ref(qAlign)).
```{r SampleSession2, eval=TRUE}
sampleFile <- "extdata/samples_chip_single.txt"
genomeFile <- "extdata/hg19sub.fa"
proj <- qAlign(sampleFile, genomeFile)
proj
```
The `proj` object keeps track of all the information of a sequencing experiment,
for example where sequence and alignment files are stored, and what aligner and
reference genome was used to generate the alignments.
Now that the alignments have been generated, further analyses can be performed.
A quality control report is saved to the `"extdata/qc_report.pdf"` file using the
`qQCReport` function.
```{r SampleSession3, eval=TRUE}
qQCReport(proj, "extdata/qc_report.pdf")
```
The number of alignments per promoter region is quantified using `qCount`. Genomic
coordinates for promoter regions are imported from a gtf file (`annotFile`) into
the `GRanges`-object with the name `promReg`:
```{r SampleSession4, eval=TRUE}
library(rtracklayer)
library(GenomicFeatures)
annotFile <- "extdata/hg19sub_annotation.gtf"
txStart <- import.gff(annotFile, format = "gtf", feature.type = "start_codon")
promReg <- promoters(txStart, upstream = 500, downstream = 500)
names(promReg) <- mcols(promReg)$transcript_name
promCounts <- qCount(proj, query = promReg)
promCounts
```
# *QuasR* Overview
The following scheme shows the major components of `r Biocpkg("QuasR")` and their
relationships:
![](QuasR-scheme.png)
`r Biocpkg("QuasR")` works with data (sequences and alignments, reference genome,
etc.) that are stored as files on your storage (the gray cylinder on the lower left
of Figure above, see section \@ref(fileStorageLocations) for details on storage locations).
`r Biocpkg("QuasR")` does not need a database management system, or these files to
be named and organized according to a specific scheme.
In order to keep track of directory paths during an analysis, `r Biocpkg("QuasR")`
makes use of a `qProject` object that is returned by the `qAlign` function, which
at the minimum requires two inputs: the name of a samples text file (see section
\@ref(sampleFile) for details), and the reference genome for the alignments
(see section \@ref(genome)).
The `qProject` object is the main argument passed to subsequent functions such as
`qQCReport` and `qCount`. The `qProject` object contains all necessary information
on the current project and eliminates the need to repeatedly enter the same information.
All functions that work on `qProject` objects can be recognized by their names starting
with the letter *q*.
Read quantification (apart from quantification of methylation which has its own
function `qMeth`) is done using the `qCount` function: It counts the alignments in
regions of interest (e.g. promoters, genes, exons, etc.) and produces a count table
(regions in rows, samples in columns) for further visualization and analysis. The
count table can also be used as input to a statistical analysis using packages such
as `r Biocpkg("edgeR")` [@edgeR], `r Biocpkg("DESeq")` [@DESeq], `r Biocpkg("DESeq2")` [@DESeq2],
`r Biocpkg("TCC")` [@TCC], `r Biocpkg("DEXSeq")` [@DEXSeq] or `r Biocpkg("baySeq")` [@baySeq].
In summary, a typical `r Biocpkg("QuasR")` analysis consists of the following steps
(some of them are optional):
- `preprocessReads` (optional): Remove adapters from start or end of reads, filter
out reads of low quality, short length or low complexity (section \@ref(preProcessing)).
- Prepare *samples file*: List sequence files or alignments, provide sample names
(section \@ref(sampleFile)).
- Prepare *auxiliary file* (optional): List additional reference sequences for
alignment of reads not matching the reference genome (section \@ref(auxiliaryFile)).
- `qAlign`: Create `qProject` object and specify project parameters. Also download
BSgenome package, create aligner indices and align reads if not already existing
(section \@ref(qAlign)).
- `qQCReport` (optional): Create quality control report with plots on sequence qualities
and alignment statistics (section \@ref(qQCReport)).
- `qExportWig` (optional): Export genomic alignments as wiggle tracks for genome
browser visualization (section \@ref(qExportWig)).
- `qCount`: Quantify alignments in regions of interest (section \@ref(qCount)).
Recurrent example tasks that may be part of any typical analysis are described in
section \@ref(exampleTasks). Example workflows for specific experiment types (ChIP-seq,
RNA-seq and Bis-seq) are described in section \@ref(exampleWorkflows).
## File storage locations{#fileStorageLocations}
Apart from `qExportWig` and `qQCReport`, which generate wig files and pdf reports,
`qAlign` is the only function in `r Biocpkg("QuasR")` that stores files on the disk
(see section \@ref(qAlign) for details). All files generated by `qAlign` are listed here
by type, together with their default location and how locations can be changed.
- *Temporary files* (default: `tempdir()`): Temporary files include reference genomes
in `fasta` format, decompressed input sequence files, and temporary alignments in
text format, and can require a large amount of disk space. By default, these files
will be written to the temporary directory of the **R** process (as reported by
`tempdir()`). If using `clObj` for parallel processing, this may be the `tempdir()`
from the cluster node(s). An alternative location can be set using the `TMPDIR`
environment variable (see `?tempdir`).
- *Alignment files (`bam` format)* (default: same directory as the input sequence files):
Alignments against reference genome and auxiliary targets are stored in `bam` format
in the same directory that also contains the input sequence file (listed in `sampleFile`).
Please note that if the input sequence file corresponds to a symbolic link, `r Biocpkg("QuasR")`
will follow the link and use the directory of the original file instead. An alternative
directory can be specified with the `alignmentsDir` argument from `qAlign`, which
will store all `bam` files in that directory even if the input sequence files are located
in different directories.
- *Alignment index files* (default: depends on `genome` and `snpFile` arguments):
Many alignment tools including `bowtie` require an index of the reference sequence
to perform alignments. If necessary, `qAlign` will build this index automatically
and store it in a default location that depends on the `genome` argument:
+ `BSgenome`: If `genome` is the name of a `r Biocpkg("BSgenome")` package
(such as `"BSgenome.Hsapiens.UCSC.hg19"`), the index will be stored as a
new **R** package in the default library path (as reported by `.libPaths()[1]`,
see `?install.packages` for details), or alternatively in the library
specified by the `lib.loc` argument of `qAlign`. The name of this index
package will be the name of the original `r Biocpkg("BSgenome")` package
with a suffix for the index type, for example `"BSgenome.Hsapiens.UCSC.hg19.Rbowtie"`.
+ `fasta`: If `genome` refers to a reference genome file in `fasta` format,
the index will be stored in a subdirectory at the same location. Similarly,
the indices for files listed in `auxiliaryFile` are store at the location
of these files. For example, the `Rbowtie` index for the genome at
`"./genome/mm9.fa"` is stored in `"./genome/mm9.fa.Rbowtie"`.
+ *Allele-specific analysis*: A special case is the allele-specific analysis,
where reference and alternative alleles listed in `snpFile` (e.g. `"./mySNPs.tab"`)
are injected into the `genome` (e.g. `"BSgenome.Mmusculus.UCSC.mm9"`)
to create two variant genomes to be indexed. These indices are saved at the
location of the `snpFile` in a directory named after `snpFile`, `genome`
and the index type (e.g. `"./mySNPs.tab.BSgenome.Mmusculus.UCSC.mm9.A.fa.Rbowtie"`).
# Example tasks{#exampleTasks}
## Create a sample file{#sampleFile}
The sample file is a tab-delimited text file with two or three columns. The first
row contains the column names: For a single read experiment, these are 'FileName'
and 'SampleName'; for a paired-end experiment, these are 'FileName1', 'FileName2'
and 'SampleName'. If the first row does not contain the correctly spelled column
names, `r Biocpkg("QuasR")` will not accept the samples file. Subsequent rows contain
the input sequence files.
Here are examples of such sample files for a single read experiment:
<pre>
```{r sampleFileSingle, echo=FALSE, results="asis"}
cat(paste(readLines(system.file(package = "QuasR",
"extdata", "samples_chip_single.txt")),
collapse = "\n"))
```
</pre>
<!--
------------------- | -------------
`FileName` | `SampleName`
`chip_1_1.fq.bz2` | `Sample1`
`chip_2_1.fq.bz2` | `Sample2`
-->
and for a paired-end experiment:
<pre>
```{r sampleFilePaired, echo=FALSE, results="asis"}
cat(paste(readLines(system.file(package = "QuasR",
"extdata", "samples_rna_paired.txt")),
collapse = "\n"))
```
</pre>
<!--
`FileName1` | `FileName2` | `SampleName`
`rna_1_1.fq.bz2` | `rna_1_2.fq.bz2` | `Sample1`
`rna_2_1.fq.bz2` | `rna_2_2.fq.bz2` | `Sample2`
-->
These example files are also contained in the `r Biocpkg("QuasR")` package and may
be used as templates. The path of the files can be determined using:
```{r sampleFile, eval=FALSE}
sampleFile1 <- system.file(package="QuasR", "extdata",
"samples_chip_single.txt")
sampleFile2 <- system.file(package="QuasR", "extdata",
"samples_rna_paired.txt")
```
The columns *FileName* for single-read, or *FileName1* and *FileName2* for paired-end
experiments contain paths and names to files containing the sequence data. The paths
can be absolute or relative to the location of the sample file. This allows combining
files from different directories in a single analysis. For each input sequence file,
`qAlign` will create one alignment file and by default store it in the same directory
as the sequence file. Already existing alignment files with identical parameters will
not be re-created, so that it is easy to reuse the same sequence files in multiple
projects without unnecessarily copying sequence files or recreating alignments.
The *SampleName* column contains sample names for each sequence file. The same name
can be used on several lines to indicate multiple sequence files that belong to the
same sample (`qCount` can use this information to automatically combine counts for
one sample from multiple files).
Three file formats are supported for input files (but cannot be mixed within a single
sample file):
- **fasta** files have names that end with '.fa', '.fna' or '.fasta'. They contain
only sequences (and no base qualities) and will thus by default be aligned on the
basis of mismatches (the best alignment is the one with fewest mismatches).
- **fastq** files have names that end with '.fq' or '.fastq'. They contain sequences
and corresponding base qualities and will be aligned by default using these qualities.
The encoding scheme of base qualities is automatically detected for each individual
fastq file.
- **bam** files have names that end with '.bam'. They can be used if the sequence
reads have already been aligned outside of `r Biocpkg("QuasR")`, and `r Biocpkg("QuasR")`
will only be used for downstream analysis based on the alignments contained in the
`bam` files. This makes it possible to use alignment tools that are not available
within `r Biocpkg("QuasR")`, but making use of this option comes with a risk and
should only be used by experienced users. For example, it cannot be guaranteed
any more that certain assumptions made by `qCount` are fulfilled by the external
aligner (see below). When using external `bam`
files, we recommend to use files which contain only one alignment per read. This
may also include multi-hit reads, for which one of the alignments is randomly
selected. This allows `r Biocpkg("QuasR")` to count the total number of reads by
counting the total number of alignments. Furthermore, if the `bam` files also
contain the unmapped reads, `r Biocpkg("QuasR")` will be able to calculate the
fraction of mapped reads. For bisulfite samples we require ungapped alignments
stored in unpaired or paired *ff* orientation (even if the input reads are *fr*).
For allele-specific `bam` files, `r Biocpkg("QuasR")` requires an additional tag
for each alignment called `XV` of type `A` (printable character) with the possible
values `R` (Reference), `U` (Unknown) and `A` (Alternative).
**fasta** and **fastq** files can be compressed with gzip, bzip2 or xz (file extensions
'.gz', '.bz2' or 'xz', respectively) and will be automatically decompressed when necessary.
### Working only with `bam` files after performing alignments
Once alignments have been created, most analyses will only require the `bam` files
and will not access the original raw sequence files anymore. However, re-creating
a `qProject` object by a later identical call to `qAlign` will still need access to
the raw sequences to verify consistency between raw data and alignments. It may be
desirable to remove this dependency, for example to archive or move away the raw
sequence files and to reclaim used disk space.
This can be achieved using the following procedure involving two sequential calls
to `qAlign`. First, `qAlign` is called with the orignial sample file (`sampleFile1`)
that lists the raw sequence files, and subsequently with a second sample file
(`sampleFile2`) that lists the `bam` files generated in the first call. Such a
second sample file can be easily generated given the `qProject` object (`proj1`)
returned by the first call:
```{r <sampleFileSeqToBam, eval=FALSE}
sampleFile1 <- "samples_fastq.txt"
sampleFile2 <- "samples_bam.txt"
proj1 <- qAlign(sampleFile1, genomeFile)
write.table(alignments(proj1)$genome, sampleFile2, sep="\t", row.names=FALSE)
proj2 <- qAlign(sampleFile2, genomeFile)
```
The analysis can now be exclusively based on the `bam` files using `sampleFile2` and `proj2`.
### Consistency of samples within a project
The sample file implicitly defines the type of samples contained in the project:
*single read* or *paired-end read*, sequences *with* or *without* qualities.
This type will have a profound impact on the downstream analysis. For example,
it controls whether alignments will be performed in single or paired-end mode,
either with or without base qualities. That will also determine availability of
certain options for quality control and quantification in `qQCReport` and `qCount`.
For consistency, it is therefore required that all samples within a project have
the same type; it is not possible to mix both single and paired-end read samples,
or **fasta** and **fastq** files in a single project (sample file). If necessary,
it may be possible to analyse different types of files in separate `r Biocpkg("QuasR")`
projects and combine the derived results at the end.
## Create an auxiliary file (optional){#auxiliaryFile}
By default `r Biocpkg("QuasR")` aligns reads only to the reference genome. However,
it may be interesting to align non-matching reads to further targets, for example
to identify contamination from vectors or a different species, or in order to
quantify spike-in material not contained in the reference genome. In `r Biocpkg("QuasR")`,
such supplementary reference files are called *auxiliary* references and can be
specified to `qAlign` using the `auxiliaryFile` argument (see section \@ref(qAlign)
for details). The format of the auxiliary file is similar to the one of the sample
file described in section \@ref(sampleFile): It contains two columns with
column names 'FileName' and 'AuxName' in the first row. Additional rows contain
names and files of one or several auxiliary references in `fasta` format.
An example auxiliary file looks like this:
<pre>
```{r auxFile, echo=FALSE, results="asis"}
cat(paste(readLines(system.file(package = "QuasR",
"extdata", "auxiliaries.txt")),
collapse = "\n"))
```
</pre>
<!--
`FileName` | `AuxName`
`NC_001422.1.fa` | `phiX174`
-->
and is available from your `r Biocpkg("QuasR")` installation at
```{r auxiliaryFile, eval=TRUE}
auxFile <- system.file(package = "QuasR", "extdata", "auxiliaries.txt")
```
## Select the reference genome{#genome}
Sequence reads are primarily aligned against the reference genome (see
section \@ref(redundancy) on how to choose a suitable reference assembly).
If necessary, `r Biocpkg("QuasR")` will create an aligner index for the genome.
The reference genome can be provided in one of two different formats:
- **a string**, referring to the name of a `r Biocpkg("BSgenome")` package:
```{r selectGenomeBSgenome, eval=TRUE}
available.genomes()
genomeName <- "BSgenome.Hsapiens.UCSC.hg19"
```
In this example, the BSgenome package `"BSgenome.Hsapiens.UCSC.hg19"` refers to
an unmasked genome; alignment index and alignments will be performed on the full
unmasked genome sequence (recommended). If using a masked genome (e.g. `"BSgenome.Hsapiens.UCSC.hg19.masked"`),
masked regions will be replaced with `"N"` bases, and this hard-masked version of
the genome will be used for creating the alignment index and further alignments.
Please also see section \@ref(redundancy) for potential issues with redundant
sequences contained in the reference genome, e.g. in `r Biocpkg("BSgenome.Hsapiens.UCSC.hg19")`
or `r Biocpkg("BSgenome.Hsapiens.UCSC.hg38")`.
- **a file name**, referring to a sequence file containing one or several reference
sequences (e.g. chromosomes) in `fasta` format:
```{r selectGenomeFile, eval=FALSE}
genomeFile <- system.file(package="QuasR", "extdata", "hg19sub.fa")
```
### Choosing a suitable (non-redundant) reference genome{#redundancy}
For some organisms, several versions of the genome assembly exist. These
differ for example in whether or not they include alternative variants for
sequences that are variable within the species. This may lead to
redundant sequences in the assembly, and thus reads mapping to
such sequences being wrongly classified as "multi-mapping", and comparing
data aligned to different assembly version may give rise to incorrect
results. A nice summary of this issue is provided in this [blog post](https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use) from
Heng Li.
The `r Biocpkg("BSgenome")` packages `r Biocpkg("BSgenome.Hsapiens.UCSC.hg19")` (versions
newer than 1.4.0 from Bioconductor 3.10) and `r Biocpkg("BSgenome.Hsapiens.UCSC.hg38")`
(all versions) do contain such redundant sequences and are therefore not ideal
references for alignment of human data. Specific "analysis set" or "primary assembly"
versions of the assembly should be used instead (see the before-mentioned [blog post](https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use) for
details).
When using a `r Biocpkg("BSgenome")` reference, `r Biocpkg("QuasR")` will check
in the `qAlign` function whether the chromosome names and lengths contained
in the header of any pre-existing bam files are identical to the ones provided by
the genome and warn if this is not the case.
## Sequence data pre-processing {#preProcessing}
The `preprocessReads` function can be used to prepare the input sequence files
prior to alignment. The function takes one or several sequence files (or pairs
of files for a paired-end experiment) in `fasta` or `fastq` format as input and
produces the same number of output files with the processed reads.
In the following example, we truncate the reads by removing the three bases from
the 3'-end (the right side), remove the adapter sequence `AAAAAAAAAA` from the
5'-end (the left side) and filter out reads that, after truncation and adapter
removal, are shorter than 14 bases or contain more than 2 `N` bases:
```{r preprocessReadsSingle,eval=TRUE}
td <- tempdir()
infiles <- system.file(package = "QuasR", "extdata",
c("rna_1_1.fq.bz2","rna_2_1.fq.bz2"))
outfiles <- file.path(td, basename(infiles))
res <- preprocessReads(filename = infiles,
outputFilename = outfiles,
truncateEndBases = 3,
Lpattern = "AAAAAAAAAA",
minLength = 14,
nBases = 2)
res
unlink(outfiles)
```
`preprocessReads` returns a matrix with a summary of the pre-processing. The matrix
contains one column per (pair of) input sequence files, and contains the total
number of reads (`totalSequences`), the number of reads that matched to the five
prime or three prime adapters (`matchTo5pAdapter` and `matchTo3pAdapter`), the
number of reads that were too short (`tooShort`), contained too many non-base
characters (`tooManyN`) or were of low sequence complexity (`lowComplexity`,
deactivated by default). Finally, the number of reads that passed the filtering
steps is reported in the last row (`totalPassed`).
In the example below we process paired-end reads, removing all pairs with one or
several `N` bases. Even if only one sequence in a pair fulfills the filtering
criteria, both reads in the pair are removed, thereby preserving the matching
order of the sequences in the two files:
```{r preprocessReadsPaired,eval=TRUE}
td <- tempdir()
infiles1 <- system.file(package = "QuasR", "extdata", "rna_1_1.fq.bz2")
infiles2 <- system.file(package = "QuasR", "extdata", "rna_1_2.fq.bz2")
outfiles1 <- file.path(td, basename(infiles1))
outfiles2 <- file.path(td, basename(infiles2))
res <- preprocessReads(filename = infiles1,
filenameMate = infiles2,
outputFilename = outfiles1,
outputFilenameMate = outfiles2,
nBases = 0)
res
```
More details on the `preprocessReads` function can be found in the function
documentation (see `?preprocessReads`) or in the section \@ref(preprocessReads).
# Example workflows{#exampleWorkflows}
## ChIP-seq: Measuring protein-DNA binding and chromatin modifications{#ChIPseq}
Here we show an exemplary single-end ChIP-seq workflow using a small number of
reads from a histone 3 lysine 4 trimethyl (H3K4me3) ChIP-seq experiment. This
histone modification is known to locate to genomic regions with a high density
of CpG dinucleotides (so called CpG islands); about 60% of mammalian genes have
such a CpG island close to their transcript start site. All necessary files are
included in the `r Biocpkg("QuasR")` package, and we start the example workflow
by copying those files into the current working directly, into a subfolder called `"extdata"`:
```{r ChIP_copyExtdata, eval=TRUE}
file.copy(system.file(package = "QuasR", "extdata"), ".", recursive = TRUE)
```
### Align reads using the `qAlign` function
We assume that the sequence reads have already been pre-processed as described
in section \@ref(preProcessing). Also, a sample file (section \@ref(sampleFile))
that lists all sequence files to be analyzed has been prepared. A `fasta` file
with the reference genome sequence(s) is also available (section \@ref(genome)),
as well as an auxiliary file for alignment of reads that failed to match the reference
genome (section \@ref(auxiliaryFile)).
By default, newly generated `bam` files will be stored at the location of the input
sequence files, which should be writable and have sufficient capacity (an alternative
location can be specified using the `alignmentsDir` argument). Make also sure that
you have sufficient temporary disk space for intermediate files in `tempdir()`
(see section \@ref(qAlign)). We start by aligning the reads using `qAlign`:
```{r ChIP_qAlign, eval=TRUE}
sampleFile <- "extdata/samples_chip_single.txt"
auxFile <- "extdata/auxiliaries.txt"
genomeFile <- "extdata/hg19sub.fa"
proj1 <- qAlign(sampleFile, genome = genomeFile, auxiliaryFile = auxFile)
proj1
```
`qAlign` will build alignment indices if they do not yet exist (by default, if the
genome and auxiliary sequences are given in the form of `fasta` files, they will
be stored in the same folder). The `qProject` object (`proj1`) returned by `qAlign`
now contains all information about the ChIP-seq experiment: the (optional) project name,
the project options, aligner package, reference genome, and at the bottom the sequence
and alignment files. For each input sequence file, there will be one `bam` file with
alignments against the reference genome, and one for each auxiliary target sequence
with alignments of reads without genome hits. Our `auxFile` contains a single auxiliary
target sequence, so we expect two `bam` files per input sequence file:
```{r ChIP_bamfiles1, eval=TRUE}
list.files("extdata", pattern = ".bam$")
```
The `bam` file names consist of the base name of the sequence file with an added
random string. The random suffix makes sure that newly generated alignment files
do not overwrite existing ones, for example of the same reads aligned against an
alternative reference genome. Each alignment file is accompanied by two additional
files with suffixes `.bai` and `.txt`:
```{r ChIP_bamfiles2, eval=TRUE}
list.files("extdata", pattern = "^chip_1_1_")[1:3]
```
The `.bai` file is the `bam` index used for fast access by genomic coordinate.
The `.txt` file contains all the parameters used to generate the corresponding `bam`
file. Before new alignments are generated, `qAlign` will look for available `.txt`
files in default locations (the directory containing the input sequence file, or
the value of `alignmentsDir`), and read their contents to determine if a compatible
`bam` file already exists. A compatible `bam` file is one with the same reads and
genome, aligned using the same aligner and identical alignment parameters. If a
compatible `bam` file is not found, or the `.txt` file is missing, `qAlign` will
generate a new `bam` file. It is therefore recommended not to delete the `.txt`
file - without it, the corresponding `bam` file will become unusable for `r Biocpkg("QuasR")`.
### Create a quality control report
`r Biocpkg("QuasR")` can produce a quality control report in the form of a series
of diagnostic plots with details on sequences and alignments (see QuasR scheme figure above).
The plots are generated by calling the `qQCReport` function with the `qProject`
object as argument. `qQCReport` uses `r Biocpkg("ShortRead")` [@ShortRead] internally
to obtain some of the quality metrics, and some of the plots are inspired by the
FastQC quality control tool by Simon Andrews (http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/).
The plots will be stored into a multipage PDF document defined by the `pdfFilename`
argument, or else shown as individual plot windows on the current graphics device.
In order to keep the running time reasonably short, some quality metrics are obtained
from a random sub-sample of the sequences or alignments.
```{r ChIP_qcplot1, eval=TRUE, echo=FALSE}
qcdat1 <- qQCReport(proj1, pdfFilename = "extdata/qc_report.pdf")
```
```{r ChIP_qcplots2, eval=TRUE}
qQCReport(proj1, pdfFilename = "extdata/qc_report.pdf")
```
Currently available plots are described in section \@ref(qQCReport) and following.
### Alignment statistics
The `alignmentStats` gets the number of (un-)mapped reads for each sequence file
in a `qProject` object, by reading the `bam` file indices, and returns them as a
`data.frame`. The function also works for arguments of type `character` with one
or several `bam` file names (for details see section \@ref(alignmentStats)).
```{r ChIP_alignmentStats, eval=TRUE}
alignmentStats(proj1)
```
### Export genome wig file from alignments
For visualization in a genome browser, alignment coverage along the genome can be
exported to a (compressed) wig file using the `qExportWig` function. The created
fixedStep wig file (see (http://genome.ucsc.edu/goldenPath/help/wiggle.html) for
details on the wig format) will contain one track per sample in the `qProject`
object. The resolution is defined using the `binsize` argument, and if `scaling`
is set to `TRUE`, read counts per bin are scaled by the total number of aligned
reads in each sample to improve comparability:
```{r ChIP_qExportWig, eval=TRUE}
qExportWig(proj1, binsize = 100L, scaling = TRUE, collapseBySample = TRUE)
```
### Count alignments using `qCount`
Alignments are quantified using `qCount`, for example using a `GRanges` object as
a query. In our H3K4me3 ChIP-seq example, we expect the reads to occur around the
transcript start site of genes. We can therefore construct suitable query regions
using genomic intervals around the start sites of known genes. In the code below,
this is achieved with help from the `r Biocpkg("txdbmaker")` package to first
create a `TxDb` object from a `.gtf` file with gene annotation. With the `promoters`
function from the `r Biocpkg("GenomicFeatures")` package, we can then create the
`GRanges` object with regions to be quantified. Finally, because most genes
consist of multiple overlapping transcripts, we select the first transcript for
each gene:
```{r ChIP_GenomicFeatures, eval=TRUE}
library(txdbmaker)
annotFile <- "extdata/hg19sub_annotation.gtf"
chrLen <- scanFaIndex(genomeFile)
chrominfo <- data.frame(chrom = as.character(seqnames(chrLen)),
length = width(chrLen),
is_circular = rep(FALSE, length(chrLen)))
txdb <- makeTxDbFromGFF(file = annotFile, format = "gtf",
chrominfo = chrominfo,
dataSource = "Ensembl",
organism = "Homo sapiens")
promReg <- promoters(txdb, upstream = 1000, downstream = 500,
columns = c("gene_id","tx_id"))
gnId <- vapply(mcols(promReg)$gene_id,
FUN = paste, FUN.VALUE = "",
collapse = ",")
promRegSel <- promReg[ match(unique(gnId), gnId) ]
names(promRegSel) <- unique(gnId)
promRegSel
```
Using `promRegSel` object as query, we can now count the alignment per sample in
each of the promoter windows.
```{r ChIP_qCount, eval=TRUE}
cnt <- qCount(proj1, promRegSel)
cnt
```
The counts returned by `qCount` are the raw number of alignments per sample and
region, without any normalization for the query region length, or the total number
of aligned reads in a sample. As expected, we can find H3K4me3 signal at promoters
of a subset of the genes with CpG island promoters, which we can visualize with help
of the `r Biocpkg("Gviz")` package:
```{r ChIP_visualize, eval=TRUE, fig.width=8, fig.height=4.5}
gr1 <- import("Sample1.wig.gz")
gr2 <- import("Sample2.wig.gz")
library(Gviz)
axisTrack <- GenomeAxisTrack()
dTrack1 <- DataTrack(range = gr1, name = "Sample 1", type = "h")
dTrack2 <- DataTrack(range = gr2, name = "Sample 2", type = "h")
txTrack <- GeneRegionTrack(txdb, name = "Transcripts", showId = TRUE)
plotTracks(list(axisTrack, dTrack1, dTrack2, txTrack),
chromosome = "chr3", extend.left = 1000)
```
### Create a genomic profile for a set of regions using `qProfile`
Given a set of anchor positions in the genome, `qProfile` calculates the number of
nearby alignments relative to the anchor position, for example to generate a average
profile. The neighborhood around anchor positions can be specified by the `upstream`
and `downstream` argument. Alignments that are upstream of an anchor position will
have a negative relative position, and downstream alignments a positive. The anchor
positions are all aligned at position zero in the return value.
Anchor positions will be provided to `qProfile` using the `query` argument, which
takes a `GRanges` object. The anchor positions correspond to `start()` for regions
on `+` or `*` strands, and to `end()` for regions on the `-` strand. As mentioned
above, we expect H3K4me3 ChIP-seq alignments to be enriched around the transcript
start site of genes. We can therefore construct a suitable `query` object from the
start sites of known genes. In the code below, start sites (`start_codon`) are imported
from a `.gtf` file with the help of the `r Biocpkg("rtracklayer")` package. In addition,
`strand` and `gene_name` are also selected for import. Duplicated start sites, e.g. from
genes with multiple transcripts, are removed. Finally, all regions are given the name
`TSS`, because `qProfile` combines regions with identical names into a single profile.
```{r ChIP_rtracklayer, eval=TRUE}
library(rtracklayer)
annotationFile <- "extdata/hg19sub_annotation.gtf"
tssRegions <- import.gff(annotationFile, format = "gtf",
feature.type = "start_codon",
colnames = "gene_name")
tssRegions <- tssRegions[!duplicated(tssRegions)]
names(tssRegions) <- rep("TSS", length(tssRegions))
head(tssRegions)
```
Alignments around the `tssRegions` coordinates are counted in a window defined by
the `upstream` and `downstream` arguments, which specify the number of bases to
include around each anchor position. For `query` regions on `+` or `*` strands,
upstream refers to the left side of the anchor position (lower coordinates),
while for regions on the `-` strand, upstream refers to the right side (higher coordinates).
The following example creates separate profiles for alignments on the *same* and
on the *opposite* strand of the regions in `query`.
```{r ChIP_qProfile, eval=TRUE}
prS <- qProfile(proj1, tssRegions, upstream = 3000, downstream = 3000,
orientation = "same")
prO <- qProfile(proj1, tssRegions, upstream = 3000, downstream = 3000,
orientation = "opposite")
lapply(prS, "[", , 1:10)
```
The counts returned by `qProfile` are the raw number of alignments per sample and
position, without any normalization for the number of query regions or the total
number of alignments in a sample per position. To obtain the average number of alignments,
we divide the alignment counts by the number of `query` regions that covered a given
relative position around the anchor sites. This coverage is available as the first
element in the return value. The shift between *same* and *opposite* strand alignments
is indicative for the average length of the sequenced ChIP fragments.
```{r ChIP_visualizeProfile, eval=TRUE, fig.width=8, fig.height=4.5}
prCombS <- do.call("+", prS[-1]) / prS[[1]]
prCombO <- do.call("+", prO[-1]) / prO[[1]]
plot(as.numeric(colnames(prCombS)), filter(prCombS[1,], rep(1/100,100)),
type = 'l', xlab = "Position relative to TSS",
ylab = "Mean no. of alignments")
lines(as.numeric(colnames(prCombO)), filter(prCombO[1,], rep(1/100,100)),
type = 'l', col = "red")
legend(title = "strand", legend = c("same as query","opposite of query"),
x = "topleft", col = c("black","red"),
lwd = 1.5, bty = "n", title.adj = 0.1)
```
### Using a `r Biocpkg("BSgenome")` package as reference genome
`r Biocpkg("QuasR")` also allows using of `r Biocpkg("BSgenome")` packages instead
of a `fasta` file as reference genome (see section \@ref(genome)).
To use a `r Biocpkg("BSgenome")`, the `genome` argument of `qAlign` is set to a
string matching the name of a `r Biocpkg("BSgenome")` package, for example
`"BSgenome.Hsapiens.UCSC.hg19"`. If that package is not already installed, `qAlign`
will abort with an informative message describing how to install the package
using `BiocManager::install`.
The corresponding alignment index will be saved as a new package, named after the
original `r Biocpkg("BSgenome")` package and the aligner used to build the index,
for example `BSgenome.Hsapiens.UCSC.hg19.Rbowtie`.
The code example below illustrates the use of a `r Biocpkg("BSgenome")` reference
genome for the same example data as above. Running it for the first time will take
a few hours in order to build the aligner index, but subsequent uses of the
same reference genome will reuse the existing index and immediately start alignments:
```{r ChIP_BSgenomeProject, eval=FALSE}
file.copy(system.file(package="QuasR", "extdata"), ".", recursive=TRUE)
sampleFile <- "extdata/samples_chip_single.txt"
auxFile <- "extdata/auxiliaries.txt"
available.genomes() # list available genomes
genomeName <- "BSgenome.Hsapiens.UCSC.hg19"
proj1 <- qAlign(sampleFile, genome=genomeName, auxiliaryFile=auxFile)
proj1
```
## RNA-seq: Gene expression profiling{#RNAseq}
In `r Biocpkg("QuasR")`, an analysis workflow for an RNA-seq dataset is very similar
to the one described above for a ChIP-seq experiment. The major difference is that
here reads are aligned using `qAlign(..., splicedAlignment=TRUE, aligner="Rhisat2")`,
which will cause `qAlign` to align reads with the `HISAT2` aligner [@hisat2] (via the
`r Biocpkg("Rhisat2")` package), rather than with `bowtie` [@bowtie]. Before the
`r Biocpkg("Rhisat2")` package was available (introduced in Bioconductor 3.9),
`qAlign(... splicedAlignment=TRUE)` aligned reads using `SpliceMap` [@SpliceMap], which
is not recommended now but still possible in order to reproduce old results.
Spliced paired-end alignments are also supported; the `splicedAlignment`
argument can be freely combined with the `paired` argument. In addition, HISAT2
also allows the specification of known splice sites, which can help in the read
alignment. This is done by specifying the argument `geneAnnotation` in `qAlign()`,
to either a `.gtf` file or a `sqlite` database generated by exporting a `TxDb` object.
### Spliced alignment of RNA-seq reads
We start the example workflow by copying the example data files into the current
working directly, into a subfolder called `"extdata"`, and then create spliced
alignments using `qAlign`:
```{r RNA_qAlign, eval=TRUE}
file.copy(system.file(package = "QuasR", "extdata"), ".", recursive = TRUE)
sampleFile <- "extdata/samples_rna_paired.txt"
genomeFile <- "extdata/hg19sub.fa"
proj2 <- qAlign(sampleFile, genome = genomeFile,
splicedAlignment = TRUE, aligner = "Rhisat2")
proj2
```
Aligning the reads with `splicedAlignment=TRUE` will allow to also align reads
that cross exon junctions, and thus have a large deletion (the intron)
relative to the reference genome.
```{r RNA_alignmentStats, eval=TRUE}
proj2unspl <- qAlign(sampleFile, genome = genomeFile,
splicedAlignment = FALSE)
alignmentStats(proj2)
alignmentStats(proj2unspl)
```
### Quantification of gene and exon expression
As with ChIP-seq experiments, `qCount` is used to quantify alignments. For
quantification of gene or exon expression levels, `qCount` can be called with a
query of type `TxDb`, such as the one we constructed in the ChIP-seq workflow above
from a `.gtf` file. The argument `reportLevel` can be used to control if annotated
exonic regions should be quantified independently (`reportLevel="exon"`) or
non-redundantly combined per gene (`reportLevel="gene"`):
```{r RNA_qCount, eval=TRUE}
geneLevels <- qCount(proj2, txdb, reportLevel = "gene")
exonLevels <- qCount(proj2, txdb, reportLevel = "exon")
head(geneLevels)
head(exonLevels)
```
### Calculation of RPKM expression values
The values returned by `qCount` are the number of alignments. Sometimes it is
required to normalize for the length of query regions, or the size of the libraries.
For example, gene expression levels in the form of *RPKM* values (reads per kilobase
of transcript and million mapped reads) can be obtained as follows:
```{r RNA_RPMK, eval=TRUE}
geneRPKM <- t(t(geneLevels[,-1] / geneLevels[,1] * 1000)
/ colSums(geneLevels[,-1]) * 1e6)
geneRPKM
```
Please note the RPKM values in our example are higher than what you would usually
get for a real RNA-seq dataset. The values here are artificially scaled up because
our example data contains reads only for a small number of genes.
### Analysis of alternative splicing: Quantification of exon-exon junctions
Exon-exon junctions can be quantified by setting `reportLevel="junction"`. In this
case, `qCount` will ignore the `query` argument and scan all alignments for any
detected splices, which are returned as a `GRanges` object: The region start and
end coordinates correspond to the first and last bases of the intron, and the
counts are returned in the `mcols()` of the `GRanges` object. Alignments that
are identically spliced but reside on opposite strands will be quantified separately.
In an unstranded RNA-seq experiment, this may give rise to two separate counts
for the same intron, one each for the supporting alignments on plus and minus
strands.
```{r RNA_junction, eval=TRUE}
exonJunctions <- qCount(proj2, NULL, reportLevel = "junction")
exonJunctions
```
About half of the exon-exon junctions detected in this sample dataset correspond
to known introns; they tend to be the ones with higher coverage:
```{r RNA_junction2, eval=TRUE}
knownIntrons <- unlist(intronsByTranscript(txdb))
isKnown <- overlapsAny(exonJunctions, knownIntrons, type = "equal")
table(isKnown)
tapply(rowSums(as.matrix(mcols(exonJunctions))),
isKnown, summary)
```
When quantifying exon junctions, only spliced alignments will be included in the
quantification. It is also possible to only include unspliced alignments in the
quantification, for example when counting exon body alignments that complement the
exon junction alignments. This can be done using the `includeSpliced` argument
from `qCount`:
```{r RNA_includeSpliced, eval=TRUE}
exonBodyLevels <- qCount(proj2, txdb, reportLevel = "exon",
includeSpliced = FALSE)
summary(exonLevels - exonBodyLevels)
```
```{r RNA_qcplot1, eval=TRUE, echo=FALSE}
qcdat2 <- qQCReport(proj2unspl, pdfFilename = "qc_report.pdf")
```
## smRNA-seq: small RNA and miRNA expression profiling