NOTE: This section is based on the code provided in the DEXSeq vignette, which can be checked for extra information.
So far we have been focusing on analysing the transcriptome from a gene-centric perspective. However, one of the advantages of RNA sequencing is that it allows us to address questions about alternative splicing in a much easier way than it used to be possible with microarrays. There are a large number of methods to detect significant differences in splicing across conditions (e.g. cuffdiff, mmdiff), many of which rely on the non-trivial task of estimating transcript expression levels. Here we will focus on DEXSeq, a Bioconductor package for the identification of differential exon usage events from exon counts. In other words, DEXSeq reports cases where the expression of a given exon, relative to the expression of the corresponding gene, changes across conditions.
The structure of a DEXSeq analysis is analogous to the one we have seen so far for differential expression with DESeq: count normalisation, dispersion estimation and differential exon usage test. However, there are a couple of things to take into account before getting started with that workflow, as we'll see next.
Exons might be represented multiple times in a standard GTF file if they are shared between multiple transcripts or genes. This overlap can include the entire exon, or just part of it, as illustrated in Figure 1 from the DEXSeq paper. Thus, in order to ensure that each exon is tested only once, DEXSeq requires a slightly modified annotation file, in which exons are flattened into counting bins. We only need to prepare this flattened annotation file once, and this can be achieved by executing the command below:
# do not run - it takes a while
# you'll find the output here: RNAseq_all/reference
cd RNAseq_all/reference
python /path/to/library/DEXSeq/python_scripts/dexseq_prepare_annotation.py \
--aggregate=no Drosophila_melanogaster.BDGP5.25.62.chr.gtf \
Drosophila_melanogaster.BDGP5.25.62.chr_dexseq_noaggregate.gff
Exercise: How does the newly generated annotation file differ from the previous one? Try this:
cd RNAseq_all/reference
original=Drosophila_melanogaster.BDGP5.25.62.chr.gtf
grep FBgn0031208 $original | awk '$3=="exon"'
flattened=Drosophila_melanogaster.BDGP5.25.62.chr_dexseq_noaggregate.gff
grep FBgn0031208 $flattened | awk '$3=="exonic_part"'
What is the number of lines obtained in each case? What is the number of counting bins for this gene? Hint - Solution
Exercise: One of the options used in this example to generate the flattened annotation file is --aggregate=no
. What does this refer to? Hint:
python /path/to/library/DEXSeq/python_scripts/dexseq_prepare_annotation.py -h
Each of the exonic parts in the flattened annotation file are the potentially testable bins. However, before we can perform the testing, we first need to know the number of reads that overlap with each of them.
Provided that we have aligned our reads to the genome with a splice-aware aligner (e.g. TopHat) we can now proceed to count the number of reads that fall into exon bins in our sample:
# do not run - it takes a while
# you'll find the output here: RNAseq_all/data/mapped
gff_file=Drosophila_melanogaster.BDGP5.25.62.chr_dexseq_noaggregate.gff
bam_file=../data/mapped/untreated3.nsorted.bam
out=$bam_file.dexseq_noaggregate.txt
samtools view $bam_file | python /path/to/library/DEXSeq/inst/python_scripts/dexseq_count.py \
--paired=yes --stranded=no $gff_file - $out
Exercise: By default, dexseq_count
will only consider reads that mapped with a mapping quality of 10 or more. Even though we didn't explicitly set this option in the command above, we can learn about this on the help text:
python /path/to/library/DEXSeq/inst/python_scripts/dexseq_count.py -h
What does this threshold refer to? Solution
Exercise: Previously we have been calculating the number of reads that overlap each gene using htseq-count
.
What's the difference between these two tools? Hint: think of the previous steps in this section and the input required by this tool. Also, would the gene counts generated with these two tools be the same? Solution
Finally we just need to load the counts into R. Here we'll be working with an example dataset, and thus we'll be loading the counts directly from the pasilla library:
library(DEXSeq)
library("pasilla")
data("pasillaExons")
ecs=pasillaExons
head(counts(pasillaExons))
Alternatively, to load the count files for your experiment into R, you should first generate a table summarising your experimental design:
cat sampleTable.txt
# countFile condition
# untreated1 untreated1.counts control
# untreated2 untreated2.counts control
# untreated3 untreated3.counts control
# untreated4 untreated4.counts control
# treated1 treated1.counts knockdown
# treated2 treated2.counts knockdown
# treated3 treated3.counts knockdown
And then load it in R and generate an expression set object:
library(DEXSeq)
sampleTable=read.table("sampleTable.txt")
annot="Drosophila_melanogaster.BDGP5.25.62.chr_dexseq_noaggregate.gff"
ecs = read.HTSeqCounts(
countfiles = sampleTable$countFile,
design = sampleTable,
flattenedfile = annot )
Independently on whether you're working with the example counts (available through the pasilla library) or the ones for your own samples, the next essential step of the workflow consists on estimating the size factors for each library, used to take into account variable sequencing depths. This step is common to the one previously followed with DESeq (see the section on Normalising counts):
ecs=estimateSizeFactors(ecs)
sizeFactors(ecs)
Also analogous to the workflow followed with DESeq is the step on estimating the dispersion on the observed counts for each of the exons (see the section on Differential gene expression). This information is used to quantify the variability that we can expect between biological replicates, and will help us in addressing which of the observed differences are big enough to be attributed to a change in the condition.
ecs=estimateDispersions( ecs )
ecs=fitDispersionFunction( ecs )
We can next plot the calculated dispersion estimates as a function of the mean normalised counts, just as a sanity test:
out="dexseq_dispersion.pdf"
pdf(file=out)
plotDispEsts( ecs )
dev.off()
Exercise: Exons with a low number of counts tend to have very high variability and will not end up as a significant result. In order to reduce computation time, DEXSeq skips such exons, as specified in the help for the estimateDispersions
function:
?estimateDispersions
What's the fraction of exons in our dataset that will be tested by DEXSeq? Hint:
counts_subset=head(counts(ecs))
testable_subset=head(fData(ecs)$testable)
cbind(counts_subset, testable_subset)
Finally, we can test for differential exon usage and calculate the fold-changes:
ecs=testForDEU(ecs)
ecs=estimatelog2FoldChanges( ecs )
result=DEUresultTable(ecs)
head( result )
Exercise: Why do we get NA
values for FBgn0000256:E006
? Solution
Exercise: How many exons do we identify as differentially used (e.g. FDR < 0.1)? How many genes? Solution
It is in general a good practise to visualise the results in the form of an MA-plot:
out="dexseq_ma.pdf"
pdf(file=out)
plotMA( ecs, FDR=0.1, ylim=c(-4,4), cex=0.8 )
dev.off()
In addition, DEXSeq offers a nice way to visualise differential exon usage events for a given gene:
out="dexseq_FBgn0085442.pdf"
pdf(file=out)
plotDEXSeq( ecs, "FBgn0085442", expression=FALSE,
norCounts=TRUE, displayTranscripts=TRUE,
legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2 )
dev.off()