-
Notifications
You must be signed in to change notification settings - Fork 0
/
advanced_uses.Rmd
executable file
·174 lines (150 loc) · 5.88 KB
/
advanced_uses.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
title: "Advanced uses of DiffSegR"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Advanced uses of DiffSegR}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r, include = FALSE}
working_directory <- "./DIFFSEGR_TEST"
dir.create(working_directory, showWarnings = FALSE)
sample_info <- data.frame(
sample = c("pnp1_1_1", "pnp1_1_2", "wt_1", "wt_2"),
condition = rep(c("pnp1_1", "wt"), each = 2),
replicate = rep(1:2,2),
bam = sapply(
c("pnp1_1_1_ChrC_71950_78500.bam",
"pnp1_1_2_ChrC_71950_78500.bam",
"wt_1_ChrC_71950_78500.bam",
"wt_2_ChrC_71950_78500.bam"
),
function(bam) system.file("extdata", bam, package = "DiffSegR")
),
isPairedEnd = rep(FALSE, 4),
strandSpecific = rep(1, 4)
)
library(DiffSegR)
nb_threads_tot = 4
nb_threads_locus = 4
data <- newExperiment(
sampleInfo = sample_info,
loci = data.frame(
seqid = "ChrC",
chromStart = 71950,
chromEnd = 78500,
locusID = "psbB_petD"
),
referenceCondition = "wt",
otherCondition = "pnp1_1",
nbThreads = nb_threads_tot,
nbThreadsByLocus = nb_threads_locus,
coverage = working_directory
)
coverage(
data = data,
coverageType = "average"
)
```
Before using this tutorial, we recommend that you read through our [introductory tutorial](introductory_tutorial.html).
# Overview
In this tutorial we explore more advanced uses of `DiffSegR`:
1. How to explore a range of segmentation models ?
2. How to select a subset of segments whose absolute $\log_2$-FC passes a threshold while keeping statistical guarantees on the true discovery proportion in this subset ? (Post hoc inference)
# Exploring more segmentations
After visual inspection, one can disagree with the segmentation of the $\log_2$-FC
proposed by **DiffSegR**. It can happen when spurious changepoints, resulting
from an over-segmentation of the $\log_2$-FC, seems to be added. In this
case it is often useful to explore several different segmentations in
order to choose a more suitable hyperparameter $\alpha$ value.
To explore all reachable segmentations between 1 to `Kmax` segments, one can pass
`segmentNeighborhood = TRUE` and specify the value of `Kmax` to `segmentationLFC()`.
This step can be time consuming. A grid search approach is also implemented in
**DiffSegR** (see `gridSearch` and `alphas` parameters of `segmentationLFC()`).
```{r}
features <- segmentationLFC(
data = data,
outputDirectory = working_directory,
segmentNeighborhood = TRUE,
Kmax = 50
)
```
The figures showing the path of all reachable segmentations up to `Kmax` segments of
the log2-FC on reverse and forward strands are accessible in the working
directory under the name *all_models_minus.rds* and *all_models_plus.rds*.
A good practice is to select a robust segmentation. Visually it is equivalent
to pick the elbow of this curve. In this example we can select
$\alpha=5.46$ symbolized by the red line.
```{r echo=FALSE, dpi=300, out.width='100%', fig.align='center', fig.cap="Path of all reachable segmentations up to 50 segments of the log2-FC on reverse strand. Corresponding alpha values on the log scale can be read on X-axis."}
all_models <- readRDS("DIFFSEGR_TEST/psbB_petD/all_models_minus.rds")
ggplot2::ggplot(
all_models,
ggplot2::aes(
x = log2(alphas),
y = K
)
)+
ggplot2::geom_line() +
ggplot2::geom_point(size=1) +
ggplot2::geom_vline(
xintercept = log2(5.46),
color = "red"
)+
ggplot2::geom_hline(
yintercept = 11,
color = "blue"
)+
ggplot2::xlab("log2(alpha)")+
ggplot2::ylab("number of segments (K)")+
ggplot2::theme(
text = ggplot2::element_text(size = 20)
) +
ggplot2::theme_bw()
```
# Post hoc inference
If the result of the DEA does not correspond to what the user expected,
they may tend to snoop in the data and eventually select a subset
of segments of interest. For instance it is a common usage to define a *threshold*
on the absolute $\log_2$-FC per-segment. In this case DiffSegR will declare
differentially expressed the largest set of segments with an absolute
$\log_2$-FC > *threshold* and a true discovery proportion lower bound choose by
the user (``postHoc_tdpLowerBound``). This post-hoc
lower bound is obtained by controlling the joint error rate (JER) using the
Simes family of thresholds (Blanchard et al. 2020).
To use post hoc inference one have to specify :
1. its decision rule, i.e a predicate function that returns a single `TRUE`
or ``FALSE`` if the segment on which it is applied meets the conditions defined in
the predicate. The criteria used in the predicate have to be defined for each
segment in ``SummarizedExperiment::mcols(dds)``. Here, ``predicate = function(x) 2^abs(x$log2FoldChange)>2`` ;
2. the minimum true discovery proportion among the subset of segments declared
differentially expressed. Here, ``postHoc_tdpLowerBound = 0.05`` ;
3. the significance level of the test procedure. Here,
``postHoc_significanceLevel = 0.05``.
```{r, include = FALSE}
SExp <- counting(
data = data,
features = features
)
```
```{r}
dds <- dea(
SExp = SExp,
design = ~ condition,
sizeFactors = rep(1,4),
predicate = function(x) 2^abs(x$log2FoldChange)>2,
postHoc_significanceLevel = 0.05,
postHoc_tdpLowerBound = 0.95
)
#- check fold-change of differentially expressed segments ---------------------#
DERs <- dds[SummarizedExperiment::mcols(dds)$DER,]
2^abs(SummarizedExperiment::mcols(DERs)$log2FoldChange)
```
# References
Blanchard, G., Neuvial, P. and Roquain, E. Post hoc confidence bounds on false
positives using reference families. Ann. Statist 48(3), 1281-1303 (2020) [doi.org/10.1214/19-AOS1847](https://hal.science/hal-01483585v5/document).