Table of contents:
SynExtend is a package of tools for working with synteny objects generated from the Bioconductor Package DECIPHER.
SynExtend is present in the package repository Bioconductor and the current release
version can be installed via:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install()
BiocManager::install("SynExtend")
library("SynExtend")
The devel
version in Bioconductor can be installed via:
BiocManager::install("SynExtend", version = "devel")
library("SynExtend")
The version in this repository (which should always be up to date with devel
and frequently ahead of release
) can be installed with devtools via:
devtools::install_github(repo = "npcooley/synextend")
library("SynExtend")
Additionally, SynExtend is maintained in a Docker container on Dockerhub that is here.
SynExtend facilitates the prediction of orthologous gene pairs from synteny maps. The functions herein rely on synteny maps built by the FindSynteny()
function in the package DECIPHER
. The process is relatively straightforward and only requires assemblies and genecalls, which can be conveniently pulled from the NCBI using Entrez Direct. However, as long as given assemblies have associated gene calls, or can have gene calls generated, they will work just fine. DECIPHER
now contains the function FindGenes()
that produces high quality de novo gene calls.
Using Entrez Direct, this process would start like this:
library(SynExtend)
EntrezQuery <- paste("esearch -db assembly ",
"-query '",
"kitasatospora[organism] ",
'AND "complete genome"[filter] ', # only complete genomes
'AND "refseq has annotation"[properties] ', # only genomes with annotations
'AND "latest refseq"[filter] ', # only latest
"NOT anomalous[filter]' ",
'| ',
'esummary ',
'| ',
'xtract -pattern DocumentSummary -element FtpPath_RefSeq',
sep = "")
FtPPaths <- system(command = EntrezQuery,
intern = TRUE,
timeout = 300L) # timeout argument is required for RStudio only
FNAs <- unname(sapply(FtPPaths,
function(x) paste(x,
"/",
strsplit(x,
split = "/",
fixed = TRUE)[[1]][10],
"_genomic.fna.gz",
sep = "")))
GFFs <- unname(sapply(FtPPaths,
function(x) paste(x,
"/",
strsplit(x,
split = "/",
fixed = TRUE)[[1]][10],
"_genomic.gff.gz",
sep = "")))
With these matched character vectors describing files on the NCBI FTP site, we can pull given assemblies, and their associated GFF
files.
Assemblies are pulled using functions within the package Biostrings
and placed in a sqlite database using the package DECIPHER
. Note, DBPATH
can be a specific local file location, though a tempfile works fine for this example. The tempfile in this example will be destroyed upon closure of the active R session, so specific file paths are preferable for most user activity. The code chunk below provides an example of pulling down assemblies from the NCBI and using DECIPHER
's built in gene finder to generate gene calls.
DBPATH <- tempfile()
GC01 <- vector(mode = "list",
length = length(FNAs))
for (m1 in seq_along(FNAs)) {
X <- readDNAStringSet(FNAs[m1])
Seqs2DB(seqs = X,
type = "XStringSet",
dbFile = DBPATH,
identifier = as.character(m1),
verbose = TRUE)
GC01[[m1]] <- FindGenes(myDNAStringSet = X)
}
names(GC01) <- seq(length(FNAs))
A current quirk to this process is that genecalls must be present in a named list, where the names can be matched to integer identifiers in the DECIPHER
database used to construct a synteny object. Users have the option of using the function FindGenes
in DECIPHER
, using the rtracklayer
function import
to import a valid GFF
file, or importing a GFF
with the SynExtend
function gffToDataFrame
which performs some data munging to convert the GFF
into a DataFrame
for user convenience.
GC02 <- vector(mode = "list",
length = length(GFFs))
for (m1 in seq_along(GFFs)) {
GC02[[m1]] <- gffToDataFrame(GFF = GFFs[m1],
Verbose = TRUE)
}
names(GC02) <- seq(length(GC02))
With assemblies and genecalls now in a user's workspace, users can build a synteny map with FindSynteny
. See ?FindSynteny
in R for further reading. Finding and tabulating where genes are connected by syntenic k-mers is then performed with NucleotideOverlap
, and that raw data is converted into a convenient data.frame
with the function PairSummariers
.
Syn <- FindSynteny(dbFile = DBPATH)
Links <- NucleotideOverlap(SyntenyObject = Syn,
GeneCalls = GC01,
Verbose = TRUE)
Pairs01 <- PairSummaries(SyntenyLinks = Links,
PIDs = TRUE,
Verbose = TRUE)
Clusters01 <- DisjointSet(Pairs = Pairs01,
Verbose = TRUE)
Pairs where predictions create conflicts can be removed by a simple subsetting function.
Pairs02 <- SubSetPairs(CurrentPairs = Pairs01,
Verbose = TRUE)
Clusters02 <- DisjointSet(Pairs = Pairs02,
Verbose = TRUE)
A more self-contained example is maintained in this package's Bioconductor vignette, and can be found here
SynExtend also allows for prediction of functional association networks from evidence of correlated selective pressures between of pairs of clusters of orthologous genes (COGs). Given a list of gene trees or presence/absence data, the EvoWeaver
class in SynExtend uses multiple algorithms to compute likelihood of pairwise functional association. This list can be generated with DisjointSet(...)
, or can be created by the user.
An example of using EvoWeaver
can be done using the built-in example Streptomyces dataset. This will predict functional association of 200 genes from 301 genomes.
exData <- get(data("ExampleStreptomycesData"))
ew <- EvoWeaver(exData$Genes)
# For faster runtime, set subset as follows:
# predictions <- predict(ew, mySpeciesTree=exData$Tree, subset=1:20)
predictions <- predict(ew, mySpeciesTree=exData$Tree)
# View number of genes and number of predictions:
# Print out the adjacency matrix:
predictions
# Print out pairwise associations:
as.data.frame(predictions)
# Plot genes using force-directed embedding:
plot(predictions)
Current algorithms used for prediction are Jaccard distance, Hamming distance, Mutual Information, Direct Coupling Analysis, Phylogenetic Gain/Loss events, Co-localization, MirrorTree, and ContextTree. See ?EvoWeaver
for more details.
SynExtend currently has an extensive TODO list that includes:
- Parsing communities of predicted pairs.
- User friendly accessor functions to help with printing and plotting data.
If users have specific requests I would love to incorporate them.
Feel free to email me at npc19@pitt.edu or submit issues here on GitHub. Questions related to EvoWeaver
can be sent to ahl27@pitt.edu or submitted similarly on GitHub.