-
Notifications
You must be signed in to change notification settings - Fork 0
/
analysis.Rmd
126 lines (87 loc) · 5.79 KB
/
analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
title: "R Notebook"
---
```{r}
# install require packages
# BiocManager::install("YuLab-SMU/treedataverse")
# to install (tidytree, treeio, ggtree, ggtreeExtra)
# load required libraries
library(treedataverse)
library(readr)
```
# Introduction
The project aims to simulate and rank the most common issues that faces phylogenomic multiloci analyses. The main effects that this project investigates are:
- Horizontal gene transfer (HGT)
- Paralogy
- Long branch attraction (LBA)
Each of these factors is incorporated in each of the simulated alignments using a reference base tree. The reference clean ture tree has four groups of taxa:
- Group A: sister to B
- Group B: sister to A
- Group C: First split of the ingroup, sister to (A,B)
- Group Z: Outgroup to define the root of the ingroup
## Creating the tree for simulating the errors
### HGT
For HGT simulations, Branches at different hisotry levels from group B is simulated as HGT events targetting group C. All alignment will have these copies to be tested incrementally and in different combinations. The expected breaking point is when group C become aretifically closer to group B leaving out group A the first split.
### Paralogy
Paralogy problem occurs in genes in which part of the ingroup taxa are related through a duplication event that preceded their speciation. Two levels of paralogy are simulated by branches that emrge before the speciation node of the main ingroups A, B, and C. These two paralog pranches are after the B group to simulate the process of mixing paralogs with orthologs.
### Long Branch Attraction
Long branch attraction is when two taxa with long branches are aretifcially inferred as closer to each other than their true relation. This phenomena is more problematic if the outgroup and one of the ingroup taxa has long branches that make them aretificially drown toward eachother.
Another scnario would be to measure how long branch groups are affected by the other two problems, namely HGT and paralogy. To simulate this, a another branch of Groups C is simulated to have evolved with long branches and tested if their taxa are contaminated with HGT. As for testing how paralogy contamination affects taxa with long pranches the simulation is done by simulated Group B as a long branch, and simulated with gradually replacing true sequences with paralog ones from the same two branches that split before the LCA of ortholog group of A, B and C.
```{r}
# function to scale branch lengths of a raw newick tree
# read from a tree file and write to a new tree file
newick_file_scale <- function(infile = '', outfile = '', scl = 1) {
require(readr)
tretxt <- unlist(read_file('simTree_fix.tre') |> strsplit(":"))
nvalues <- as.numeric(sub("^(\\d\\.\\d+).*", "\\1", tretxt[-1])) * scl
nvalues <- c("", as.character(nvalues))
nvalues <- paste0(sapply(seq_along(tretxt), \(z) sub("^\\d\\.\\d+", nvalues[z], tretxt[z])), collapse = ":")
write_file(nvalues, outfile)
# Figure for the reference tree to simulated the alingments
simTree <- read.tree('simTree.tre')
simTree[["edge.length"]] <- simTree[["edge.length"]] * 2
write.tree(simTree, file = 'alignments/simTree1.tre')
groupInfo <- split(simTree$tip.label, gsub("_[^_]+_", "_", simTree$tip.label))
simTree <- groupOTU(simTree, groupInfo)
ggtree(simTree, aes(color = group), branch.length = 'rate') +
geom_tiplab() +
xlim(NA, 5)
```
## Simulating the alignments
AliSim is a functionality implemented in IQTree version 2.2.00 that can simulate the evolution of sequences given a tree and an evolution model.
```{r}
iqtree_sh <- 'alignments/generate_alignments_iqtree.sh'
grep('^iqtree2', readLines(iqtree_sh), value = TRUE) |> cat()
```
## Testing the evolution state of the simulated alignments
Inferring a single tree from each alignment, using FastTree to check that the alignment has evolved enough that each alignment individually does not sufficient phylogenetic information to produce the true tree. We are using seqkit to keep on the orthologs and the ougroup.
```{r}
path_testtrees <- 'alignments/testtrees/'
dir.create(path = path_testtrees, showWarnings = FALSE)
grep("FastTree", readLines(iqtree_sh), value = TRUE) |> cat()
```
### Testing the proportion of trees that have the true root split
```{r}
treeset <- lapply(dir(path = 'alignments/testtrees/', pattern = '\\.tre$', full.names = TRUE), \(z) {read.tree(z)})
treeset <- .compressTipLabel(treeset)
insistax <- grep(pattern = '^genus(A|B)',treeset[[1]][["tip.label"]])
sum(sapply(treeset, \(z) {is.monophyletic(z, tips = insistax)})) / length(treeset)
```
Only 11 trees (55%) recovered the correct root split, other nodes are ignored.
### Analysis
First, n sets of pure orthologus alignment are tested to provide a benchmark for an estimated cutoff point for the size at which a reliable phylogenetic signal is achieved. Creating 5 supermatrices with randomly selected partitions starting with 2 partitions and increments of 1 till 20. Each set will be concatented and a RAxML tree with bootstraps is inferred.
The test case supermatrices will be of full size (20). For each test case, one orthologous alignment is replaced with a non-orthologous till reaching 50% (10).
#### Long branch effect
- Replacing genusB_spxx_ortholog with genusB_spxx_longbranch
# Read the genrated alignments into R
```{r}
simAlignments <- dir(path = 'alignments/', pattern = '\\.fa$', full.names = TRUE)
simAlignments <- setNames(lapply(simAlignments, function(z){
as.matrix(read.FASTA(file = z, type = 'AA'))
}), nm = gsub('(^alignments/+)|(\\.fa$)', '', simAlignments))
```
This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Cmd+Shift+Enter*.
```{r}
}
```