generated from Bioconductor/BuildABiocWorkshop
/
CoevolutionNetworks.Rmd
404 lines (331 loc) · 15.4 KB
/
CoevolutionNetworks.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
---
title: "Building Coevolution Networks with SynExtend"
author: Aidan Lakshman^[ahl27@pitt.edu]
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{CoevolutionNetworks}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
markdown:
wrap: 72
---
```{r setup, include=FALSE}
options(rmarkdown.html_vignette.check_title = FALSE)
knitr::opts_chunk$set(echo = TRUE)
library(DECIPHER)
library(SynExtend)
library(igraph)
```
```{r echo=FALSE, out.width='100%'}
knitr::include_graphics('images/PipelineProtWeaver.png')
```
## Coevolutionary Analysis
At this point, we've walked through the steps to take a set of sequences and obtain
a set of COGs with phylogenetic reconstructions for each COG. We're now ready to look for signals of coevolution, which imply functional associations between COGs.
These methods are implemented via the `EvoWeaver` class in `SynExtend`, which
includes many commonly used methods for detecting coevolutionary patterns.
While the previous steps have only utilized a small subsample of the data, we're
now finally going to work with the complete dataset. This dataset is comprised
of the 91 *Micrococcus* genomes available with assemblies at the Scaffold,
Chromosome, or Complete level ([link to query](https://www.ncbi.nlm.nih.gov/data-hub/genome/?taxon=1269&annotated_only=true&refseq_annotation=true&assembly_level=1%3A3)). Note that more genomes may become
available after this conference; these are all that were available at the time.
We ran the complete pipeline of identifying and annotating genes with `DECIPHER`, finding COGs with `SynExtend`, and then creating gene trees for each COG using `DECIPHER`. The complete data consist of 3,407 distinct COGs. All of this analysis
is performed entirely within `SynExtend` and `DECIPHER`; no external packages
or data are required aside from the input genomes.
We now use the new `EvoWeaver` class to try to find COGs that show evidence of
correlated evolutionary selective pressures, also referred to as 'coevolutionary
signal'.
```{css echo=FALSE}
.hvr-grow-shadow {
display: inline-block;
vertical-align: middle;
-webkit-transform: perspective(1px) translateZ(0);
transform: perspective(1px) translateZ(0);
box-shadow: 0 0 1px rgba(0, 0, 0, 0);
-webkit-transition-duration: 0.3s;
transition-duration: 0.3s;
-webkit-transition-property: box-shadow, transform;
transition-property: box-shadow, transform;
}
.hvr-grow-shadow:hover, .hvr-grow-shadow:focus, .hvr-grow-shadow:active {
box-shadow: 0 10px 10px -10px rgba(0, 0, 0, 0.5);
-webkit-transform: scale(1.1);
transform: scale(1.1);
}
```
```{r echo=FALSE}
dir <- system.file('extdata', 'CoevNetworks',
package='CompGenomicsBioc2022')
downloadthis::download_dir(
path = dir,
output_name = "CoevNetworkFiles",
button_label = "Download Complete Micrococcus Dataset",
button_type = "success",
has_icon = TRUE,
icon = "fa fa-download",
self_contained = FALSE,
class = "hvr-grow-shadow"
)
```
```{r eval=FALSE}
library(DECIPHER)
library(SynExtend)
COGsFile <- '/path/to/MicrococcusCOGs.RData'
AnnotationsFile <- '/path/to/MicrococcusAnnotations.RData'
TreesFile <- '/path/to/MicrococcusGeneTrees.RData'
# All of these files are formatted the same way, as lists with each entry
# corresponding to a COG. COGs entries are identifiers for each gene, CogsAnnot
# entries are annotations, and CogTrees entries are gene trees
load(COGsFile) # Loads 'COGs'
load(AnnotationsFile) # Loads 'CogsAnnot'
load(TreesFile) # Loads 'CogTrees'
```
```{r echo=FALSE}
COGsFile <- system.file('extdata', 'CoevNetworks',
'MicrococcusCOGs.RData',
package='CompGenomicsBioc2022')
AnnotationsFile <- system.file('extdata', 'CoevNetworks',
'MicrococcusAnnotations.RData',
package='CompGenomicsBioc2022')
TreesFile <- system.file('extdata', 'CoevNetworks',
'MicrococcusGeneTrees.RData',
package='CompGenomicsBioc2022')
load(COGsFile)
load(AnnotationsFile)
load(TreesFile)
```
```{r}
# Let's look at one of the COGs
COGs[[7]]
CogsAnnot[[7]]
CogsAnnot[[7]][[1]]
CogTrees[[7]]
# Removing leaf labels because they make it hard to read
plot(CogTrees[[7]], leaflab='none', main='COG 7')
```
There is a ton of data here, and we unfortunately don't have time to look at
all of it. To demonstrate some of the things we can do with `EvoWeaver`, we're
going to look at subset of the data that is easier to investigate.
We'll subset the COGs to ones that meet the following characteristics:
- COGs with no paralogs
- COGs not part of the core genome
- COGs present in at least 5 genomes (not singletons or super rare ones)
- COGs with at least one high confidence annotation
- COGs that imply a coding region
These are relatively arbitrary requirements, but they subset the data to
an example that runs quickly and is easily understandable. This essentially takes
a group of COGs that are not essential to the organisms but tend to appear a lot,
that have been characterized individually before.
```{r results='hide', fig.keep='all', warning=FALSE}
## Subsetting COGs
# Cutoff values (91 genomes total!)
CoreCutoff <- 88
UnderCutoff <- 5
# Get assembly identifiers for each COG
truncCOGs <- lapply(COGs, \(x) sort(as.integer(gsub('^([0-9]+)_.*', '\\1', x))))
# Find COGs without paralogs
noParas <- sapply(truncCOGs, \(x) length(x) == length(unique(x)))
# Get non-core genome
notCoreCOGs <- sapply(truncCOGs, \(x) length(unique(x)) < CoreCutoff)
# Get genes in more than 5 organisms
notSingles <- sapply(truncCOGs, \(x) length(unique(x)) > UnderCutoff)
# Make sure COGs are coding elements
codingCOGs <- sapply(CogsAnnot, \(x) is(x, 'Taxa'))
# At least one high confidence annotation
highConf <- sapply(CogsAnnot, \(x)
if(is(x, 'Taxa'))
max(sapply(x, \(y)
y$confidence[length(y$confidence)])) > 50
else FALSE
)
# Subset for the workshop
WorkshopSubset <- noParas & notCoreCOGs & notSingles & codingCOGs & highConf
# Subset our data
WCogs <- COGs[WorkshopSubset]
WAnnots <- CogsAnnot[WorkshopSubset]
WTrees <- CogTrees[WorkshopSubset]
```
Let's also put together a list of consensus annotations for each COG.
```{r}
# Initialize a character vector to populate
consAnnots <- vector('character', length=length(CogsAnnot))
# Loop along the annotations for each COG
for ( i in seq_along(CogsAnnot) ) {
taxaentry <- CogsAnnot[[i]]
# If no annotation, it's a noncoding gene
if (!is(taxaentry, 'Taxa'))
consAnnots[[i]] <- 'NONCODING'
# Otherwise it's a coding gene
else {
# Grab all the annotations aside from "Unclassified"
annots <- sapply(taxaentry, \(y) y$taxon[length(y$taxon)])
annots <- annots[annots!='unclassified_Root']
# If we only have "Unclassified", just mark it as uncharacterized
if (length(annots) == 0)
consAnnots[[i]] <- 'Uncharacterized'
# Otherwise take the most common annotation
else
consAnnots[[i]] <- names(sort(table(annots), decreasing=T))[1]
}
}
```
Now we can make our `EvoWeaver` object. `EvoWeaver` has multiple input options,
either a list formatted like `COGs` (list with gene identifiers) or a list like
`CogTrees` (list with gene trees).
```{r eval=FALSE}
# EvoWeaver constructor
pw <- EvoWeaver(WTrees)
```
The `EvoWeaver` constuctor automatically detects the type of data you have and
adjusts available predictors accordingly. While it functions best with a list
of dendrograms for each COG, it can also run with simple presence/absence patterns.
See the documentation file for `EvoWeaver` for more information on this functionality.
We're now ready to make predictions. Predicting functional associations is done
with the `predict.EvoWeaver` S3 method. Let's examine possible functional associations between the COGs we have.
```{r echo=FALSE}
OutFile <- system.file('extdata', 'CoevNetworks',
'PrecomputedOutput.RData',
package='CompGenomicsBioc2022')
load(OutFile)
```
```{r eval=FALSE}
preds <- predict(pw)
```
```{r}
print(preds)
```
## Viewing our results
Notice that `preds` is a `EvoWeb` object. This is just a simple S3 class with a
pretty print method wrapping a matrix of pairwise association scores. `EvoWeb`
inherits from the `simMat` class, which incorporates functionality similar
to `dist` objects (notably memory-efficient storage of symmetric matrices) but with matrix-like operations and pretty print functions.
See `?simMat` for more info.
`EvoWeb` objects are easily coercible to matrices using `as.matrix`, but be aware matrix representations require roughly double the storage. `as.data.frame` will coerce `EvoWeb` objects into a `3xN` matrix of pairwise scores.
We can use these functions to examine a histogram of association scores for all our predictions:
```{r}
pwData <- as.matrix(preds)
hist(pwData[upper.tri(pwData)], main='Histogram of Association Scores',
xlab='Strength of Association')
```
The `EvoWeb` class will be updated next release cycle to include more methods,
including a custom plotting function. The current `plot.EvoWeb` S3 method
implements a force-directed embedding of the pairwise scores, but it's a
big work-in-progress. Stay tuned for the next release cycle for more functionality
regarding `EvoWeb`.
In the meantime, we can use the `igraph` package to find clusters of coevolving
COGs.
```{r eval=FALSE}
library(igraph)
```
```{r}
set.seed(123) # Reproducibility
adjMatrix <- as.matrix(preds)
g <- graph_from_adjacency_matrix(adjMatrix, weighted=TRUE,
mode='undirected', diag=FALSE)
clusters <- cluster_louvain(g)
# Getting the clusters & identifying COGs by consensus annotation
clusterLabels <- vector('list', length(clusters))
for ( i in seq_along(clusterLabels) ){
cluster <- communities(clusters)[[i]]
labs <- consAnnots[as.integer(cluster)]
clusterLabels[[i]] <- labs[order(sapply(labs, \(x) strsplit(x, ' ')[[1]][3]))]
}
# Let's examine a cluster
clusterLabels[[3]]
```
Note here that we've identified all the COGs in this group to be associated based
on coevolutionary pressures, and this conclusion is supported by annotation
consistency across the members of this group. Of note here is that this clustering
was done with no prior knowledge on what these COGs do; we reached this conclusion
using purely sequencing data as input. This cluster is also very similar to
results in [the STRING database](https://version-11-5.string-db.org/cgi/network?networkId=bHOgs2hNXWxE).
While STRING doesn't have the `nixA`, it does have `Achl_0397`, a similar
nickle/cobalt transporter. We also see some other proteins that could have activity associated with urease function.
We chose cluster 3 because it's small and easy to see common function. Louvain clustering identifies 4 total clusters, with cluster 1 seemingly related to stress response activity, cluster 2 related to transporting and using metals, cluster 3 related to urease activity, and cluster 4 relating to arabinogalactan/maltooligosaccharide transport.
It's important to note here that this analysis is agnostic to the presence/absence of annotations. We subset our proteins to ones with high confidence annotations in this example primarily for purposes of demonstration. In actual applications we would use `EvoWeaver` with annotated and non-annotated proteins to infer novel function of hypothetical proteins.
## Methods Implemented in EvoWeaver
By default, `predict.EvoWeaver` makes an ensemble prediction using as many individual
models as it can run with the data provided. However, users are free to use any of
the individual models without the ensemble predictor. The methods implemented are
the following:
```{r eval=FALSE}
# PHYLOGENETIC PROFILING METHODS:
## P/A = Presence/Absence Profiles
## Jaccard distance of P/A
Jaccard <- predict(pw, Method='Jaccard')
## Hamming distance of P/A
Hamming <- predict(pw, Method='Hamming')
## MutualInformation of P/A
MutualInf <- predict(pw, Method='MutualInformation')
## Direct Coupling Analysis of P/A
ProfDCA <- predict(pw, Method='ProfileDCA')
## Correlation of gain/loss events on phylogeny, requires Species Tree
## Commented out because we don't have a species tree
#Behdenna <- predict(pw, Method='Behdenna', mySpeciesTree=exSpeciesTree)
# CO-LOCALIZATION METHODS:
Colocalization <- predict(pw, Method='Coloc') # Co-localization analysis
# DISTANCE MATRIX METHDOS:
MirrorTree <- predict(pw, Method='MirrorTree')
ContextTree <- predict(pw, Method='ContextTree')
# Residue Methods: (ONLY AVAILABLE IN DEV VERSION)
# ## MutualInf of residues
ResidueMI <- predict(pw, Method='ResidueMI')
```
## Runtime Considerations
Runtime of `EvoWeaver` depends on the number of COGs and the method used. Most of
the methods are extremely quick, on the order of 0-1 seconds per comparison.
The default `Method="Ensemble"` implements the fastest methods, giving a good
compromise between speed and accuracy. For a set of 100 COGs, `Ensemble` prediction
for all pairs took 30 seconds on a 2021 M1 MacBook Pro.
Note that number of pairs computed scales quadratically with the number of COGs.
100 COGs results in ~5k comparisons, but 200 COGs results in ~20k comparisons.
`ResidueMI` is currently the slowest method implemented, and is not used in
`Ensemble` prediction. Prediction on 10 COGs took 30 seconds. Future releases
will optimize residue level methods for use in `Ensemble` prediction, but use of
these methods is inherently difficult due to runtime scaling on both the number of
COGs, the size of each tree, and the length of each gene sequence.
These predictions can be parallelized across multiple compute nodes. This implementation
is recommended for using `ResidueMI` on large datasets. A single `ResidueMI` comparison
completes in roughly 1 second, and `Ensemble` prediction in around 0.5 seconds.
```{css, echo=FALSE}
.pagination {
display: flex;
display: -webkit-flex;
padding-left: 0;
list-style: none;
justify-content: center
}
```
<div class="center">
<ul class="pagination pagination-lg">
<li class="page-item">
<a class="page-link" href="https://www.ahl27.com/CompGenomicsBioc2022/articles/ConstructingPhylogenies.html">«</a>
</li>
<li class="page-item">
<a class="page-link" href="https://www.ahl27.com/CompGenomicsBioc2022/articles/Setup.html">1</a>
</li>
<li class="page-item">
<a class="page-link" href="https://www.ahl27.com/CompGenomicsBioc2022/articles/LoadingGenomeData.html">2</a>
</li>
<li class="page-item">
<a class="page-link" href="https://www.ahl27.com/CompGenomicsBioc2022/articles/GeneCallingAnnotation.html">3</a>
</li>
<li class="page-item">
<a class="page-link" href="https://www.ahl27.com/CompGenomicsBioc2022/articles/FindingCOGs.html">4</a>
</li>
<li class="page-item">
<a class="page-link" href="https://www.ahl27.com/CompGenomicsBioc2022/articles/ConstructingPhylogenies.html">5</a>
</li>
<li class="page-item active">
<a class="page-link" href="https://www.ahl27.com/CompGenomicsBioc2022/articles/CoevolutionNetworks.html">6</a>
</li>
<li class="page-item">
<a class="page-link" href="https://www.ahl27.com/CompGenomicsBioc2022/articles/Conclusion.html">7</a>
</li>
<li class="page-item">
<a class="page-link" href="https://www.ahl27.com/CompGenomicsBioc2022/articles/Conclusion.html">»</a>
</li>
</ul>
</div>