Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

phylogenetic tree construction in DADA2? #88

Closed
FabianRoger opened this issue Jun 29, 2016 · 36 comments
Closed

phylogenetic tree construction in DADA2? #88

FabianRoger opened this issue Jun 29, 2016 · 36 comments

Comments

@FabianRoger
Copy link

hej,

I was wondering if there is (did I miss it?) - or if you plan to add - the possibility to construct a phylogeny within the DADA2 pipeline? I was thinking something like Fasttree.

Along that line I was also thinking if you had any thoughts on how the fact that the sequence resolution of DADA2 is higher than the 3% OTU threshold might interfere with the construction of a tree. I'm think wether it would be a problem if two close sequences both matched the same sequence in the reference database? Should then both be assigned to the same tip and their abundances summed? Or maybe they would simply be split into two tips with distance 0 as it might be done already?

best

Fabian

@benjjneb
Copy link
Owner

Hi Fabian,

There is no phylogeny construction in the dada2 package, and none is planned. There are tools for making phylogenetic trees already in R though, and these can be used in a fairly straightforward fashion.

For an example of this, see our recent workflow paper: http://f1000research.com/articles/5-1492/v1 (Construct the phylogenetic tree section). For more details on tree building, the phangorn package is the relevant R tree-building package.

It is worth noting that the FastTree algorithm is not, to my knowledge, available in R. For building trees on large datasets (thousands of sequences or more) it may be necessary to use an external application.

@FabianRoger
Copy link
Author

thanks a lot, your workflow paper is super helpful!

It is worth noting though that the function used for alignment msa(seqs, method="ClustalW", type="dna", order="input") is very slow and scales exponentially with number of reads with the time almost quadrupling if the number of sequences doubles. From the two other methods ClustalOmega is not suited for nucleotides (according to the documentation) and Muscle failed (for me) with a cryptic error message *** ERROR *** MSA::SetIdCount: cannot increase count

However AlignSeqsfrom the DECIPHER package scales linear with number of sequences and aligned my dataset (~1000 short reads, ~420 bp) in ~20 seconds.

of cause I know that time is not the only criterion, but for an amplicon workflow it is maybe worth looking into?

thanks for all your help!

Fabian

@spholmes
Copy link
Collaborator

spholmes commented Jul 1, 2016

Thanks a lot for the pointer, when we do another revision for the workflow
paper we will
try DECIPHER as well,
Susan

On Fri, Jul 1, 2016 at 4:31 AM, FabianRoger notifications@github.com
wrote:

thanks a lot, your workflow paper is super helpful!

It is worth noting though that the function used for alignment msa(seqs,
method="ClustalW", type="dna", order="input") is very slow and scales
exponentially with number of reads with the time almost quadrupling if the
number of sequences doubles. From the two other methods ClustalOmega is
not suited for nucleotides (according to the documentation) and Muscle
failed (for me) with a cryptic error message *** ERROR ***
MSA::SetIdCount: cannot increase count

However AlignSeqsfrom the DECIPHER package scales linear with number of
sequences and aligned my dataset (~1000 short reads, ~420 bp) in ~20
seconds.

of cause I know that time is not the only criterion, but for an amplicon
workflow it is maybe worth looking into?

thanks for all your help!

Fabian


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#88 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/ABJcvTV6MGF5MyOjxsHhiScPOjOyRNAXks5qRPqGgaJpZM4JA42u
.

Susan Holmes
Professor, Statistics and BioX
John Henry Samter Fellow in Undergraduate Education
Sequoia Hall,
390 Serra Mall
Stanford, CA 94305
http://www-stat.stanford.edu/~susan/

@benjjneb
Copy link
Owner

benjjneb commented Nov 3, 2016

The revised version of the F1000 workflow (https://f1000research.com/articles/5-1492/v2) now uses DECIPHER for the multiple alignment (thanks for suggestion @FabianRoger), and phangorn for phylogenetic construction.

We remain interested in other options within the R universe for phylogenetic construction, but still do not plan to implement it within the dada2 package.

@benjjneb benjjneb closed this as completed Nov 3, 2016
@giriarteS
Copy link

I used DECIPHER to align ITS sequences (5542 sequences) and then tried to construct the tree with phangorn but it was too slow. I finally constructed a tree with raxml:
alignment <- read.dna("alignment.fasta",format="fasta",as.matrix=TRUE)
alignment.rax.gtr <- raxml(alignment,
m="GTRGAMMAIX", # model
f="a", # best tree and bootstrap
p=1234, # random number seed
x=2345, # random seed for rapid bootstrapping
N=100, # number of bootstrap replicates
file="alignment", # name of output files
exec="raxmlHPC-PTHREADS-SSE3", # name of executable
threads=20
)

@benjjneb
Copy link
Owner

benjjneb commented Mar 9, 2017

We should consider revisiting our recommendations here. It is definitely not uncommon for datasets to be too large for phangorn to work in reasonable amounts of time.

@joey711
Copy link
Collaborator

joey711 commented Mar 9, 2017

I agree. I usually assume with modern datasets that I need to use something like raxml.

@giriarteS
Copy link

I forgot to mention that I used the raxml funtion in the the ips package(v 0.0.7).

@josemseoane
Copy link

Hello,

Thanks a lot for the workflow, is really great and I have learned a lot going through it!
Triying to deal with a lot of files I swithed to raxml in the ips package as suggested by giriarteS and everything went fine except that RAxML is reporting a weird error about my taxon lenght name and reported error "status 65535". Here is wha I did, I would greatly appreciate any help you may provide!

Thanks a lot!

Jose

exec<- "C:/Users/R100295/Documents/Bioinfo/RAxML/raxmlHPC-PTHREADS-SSE3.exe"
alignment.rax.gtr <- raxml(alignment,

  •                        m="GTRGAMMAIX", # model
    
  •                        f="a", # best tree and bootstrap
    
  •                        p=1234, # random number seed
    
  •                        x=2345, # random seed for rapid bootstrapping
    
  •                        N=100, # number of bootstrap replicates
    
  •                        file="alignment", # name of output files
    
  •                        exec= exec # name of executable
    
  •                        #threads=8
    
  • )
    [1] "C:/Users/R100295/Documents/Bioinfo/RAxML/raxmlHPC-PTHREADS-SSE3.exe -f a -p 1234 -x 2345 -m GTRGAMMAIX -N 100 -s alignment.phy -n alignment"

WARNING: The number of threads is currently set to 0
You can specify the number of threads to run via -T numberOfThreads
NumberOfThreads must be set to an integer value greater than 1

RAxML, will now set the number of threads automatically to 2 !

Taxon Name too long at taxon 1, adapt constant nmlngth in
axml.h, current setting 256
Error in file(file, "r") : cannot open the connection
In addition: Warning messages:
1: running command 'C:/Users/R100295/Documents/Bioinfo/RAxML/raxmlHPC-PTHREADS-SSE3.exe -f a -p 1234 -x 2345 -m GTRGAMMAIX -N 100 -s alignment.phy -n alignment' had status 65535
2: In file(file, "r") :
cannot open file 'RAxML_info.alignment': No such file or directory

@giriarteS
Copy link

If you need longer taxon names you can adapt the constant #define nmlngth 256 in file axml.h appropriately. Check the lenght of your SV ID (sequence) and change that parameter in the axml.h file. I just gave a short name to each single variant (SV1.....SVn), I work with ITS and my sequence lengths are really variable. Below I describe my workflow:
seqtab.nochim1 <- seqtab.nochim
colnames(seqtab.nochim1) <- paste0("SV", 1:ncol(seqtab.nochim1))
Then added taxonomy to the original seqtab.nochim
taxtab <- assignTaxonomy(seqtab.nochim, ref_fasta_UNITE, minBoot = 50, tryRC = TRUE,
outputBootstraps = FALSE, taxLevels = c("Kingdom", "Phylum", "Class",
"Order", "Family", "Genus", "Species"), multithread = TRUE,
verbose = TRUE)
and change the names again (the order was the same as in seqtab.nochim)
taxtab1 <- taxtab
rownames(taxtab1) <- paste0("SV", 1:nrow(taxtab1))

The SV table and SV sequences were exported also:
export_taxa_table_and_seqs = function(seqtab.nochim, file_seqtab, file_seqs) {
seqtab.t = as.data.frame(t(seqtab.nochim))
seqs = row.names(seqtab.t)
row.names(seqtab.t) = paste0("SV", 1:nrow(seqtab.t))
outlist = list(data_loaded = seqtab.t)
mctoolsr::export_taxa_table(outlist, file_seqtab)
seqs = as.list(seqs)
seqinr::write.fasta(seqs, row.names(seqtab.t), file_seqs)
}
export_taxa_table_and_seqs(seqtab.nochim,"/path_to your_file/SV_table.txt",
"/path_to your_file/SV_seqs.fa")

Finally you can load your SV fasta file and align the sequences with DECIPHER:
fas <- "SV_seqs.fa"
seqs <- readDNAStringSet(fas)
seqs <- OrientNucleotides(seqs) ## I found duplicated SVs after I reoriented the sequences
alignment <- AlignSeqs(seqs, anchor=NA)

I exported my alignment to a server to run RAXML, it took 9 days for 5000 sequences.
phang.align <- phyDat(as(alignment, "matrix"), type="DNA")
write.phyDat(phang.align, file="alignment.fasta", format="fasta")

TS3alignment <- read.dna("alignment.fasta",format="fasta",as.matrix=TRUE)
alignment.rax.gtr <- raxml(myalignment,
m="GTRGAMMAIX", # model
f="a", # best tree and bootstrap
p=1234, # random number seed
x=2345, # random seed for rapid bootstrapping
N=100, # number of bootstrap replicates
file="alignment", # name of output files
exec="raxmlHPC-PTHREADS-SSE3", # name of executable
threads=20
)

Then you incorporate everything to your phyloseq object:
tree <- read_tree("RAxML_bipartitionsBranchLabels.alignment")
metadata_path <- "Metadata.csv"
samdf <- read.csv(metadata_path, header=TRUE)
all(rownames(seqtab.nochim1) %in% samdf$SampleID)
rownames(samdf) <- samdf$SampleID

ps <- phyloseq(tax_table(taxtab1), sample_data(samdf),otu_table(seqtab.nochim1, taxa_are_rows = FALSE), phy_tree(tree))
saveRDS(ps, file="ps.rds")

This workflow works well, if any of the steps seen redundant or if anybody has suggestions on how to improve it, please let me know.

@josemseoane
Copy link

Thanks a lot @giriarteS, that was a smart way of solving the lenght issue, it worked out perfectly well!

@jarrodscott
Copy link

anyone try 'RAxML next generation' for tree construction?

https://github.com/amkozlov/raxml-ng/

@benjjneb
Copy link
Owner

@jjscarpa I haven't tried it, but would also be interested in results from those who have!

@jarrodscott
Copy link

jarrodscott commented Jun 12, 2017 via email

@rachaelantwis
Copy link

Hi all,
I'm having a bit of trouble with using decipher in dada2.
The code I'm using is as follows:

#Make the phylogenetic tree
library(DECIPHER)
seqs <- getSequences(seqtab.nochim)
names(seqs) <- seqs # This propagates to the tip labels of the tree
alignment <- AlignSeqs(DNAStringSet(seqs), anchor=NA)

library(phangorn)
phang.align <- phyDat(as(alignment, "matrix"), type="DNA")
dm <- dist.ml(phang.align)
treeNJ <- NJ(dm) # Note, tip order != sequence order
fit = pml(treeNJ, data=phang.align)

#negative edges length changed to 0!

fitGTR <- update(fit, k=4, inv=0.2)
fitGTR <- optim.pml(fitGTR, model="GTR", optInv=TRUE, optGamma=TRUE,
rearrangement = "stochastic", control = pml.control(trace = 0))
tree <- read.tree(fitGTR$tree)

#EXPORT TO PHYLOSEQ AND MERGE WITH SAMPLE DATA (replace "..." with path name of sample data, i.e. metadata)

  #Tidying Up Sample Data
    samples.out <- data.frame(Sample=rownames(seqtab.nochim))
  
  #Import sample dataset
  library(readr)
  Sample_data <- read_csv("/Users/rachaelantwis/Dropbox/Research/Current/Papers/Ponies/Pony_sample_data.csv")
  View(Sample_data)
  
    sample.info<-Sample_data
    head(sample.info)
    
  #Samples as factor
    sample.info$Sample<-as.factor(sample.info$Sample)
    
  #Merge the Sample ID data from dada2 with the metadata  
    sample.merge<-merge(sample.info,samples.out,by="Sample",all.y=TRUE)
    nrow(sample.merge)
    #tail(sample.merge)
    
  #Rownames of sample data are sample IDs
    rownames(sample.merge)<-sample.merge$Sample
    
  #Add an indicator variable for whether a sample is a sample or a negative
    sample.merge$sampletype<-"sample"
    sample.merge$sampletype[grep("BLANK",sample.merge$Sample)]<-"negative"
  
  #Trim names of Taxa Table - label them as sequence variants (SVs)
      #note this strips out the sequence data whcih dada2 uses as a name element
      #so this is where to find the sequence if you need them
    taxa.slim<-unname(tt.plus)
    rownames(taxa.slim)<- paste("SV",seq(nrow(tt.plus)))  
    head(taxa.slim)
  
  #Trim Names of OTU Table
    seqtab.slim<-t(seqtab.nochim)
    rownames(seqtab.slim)<-paste("SV",seq(nrow(tt.plus)))
      
  #Export to Phyloseq
    library(phyloseq)
    
    ps <- phyloseq(otu_table(seqtab.slim,taxa_are_rows = TRUE), 
                   sample_data(sample.merge), 
                   tax_table((taxa.slim)), phy_tree(tree))

And I get the following error:
Error in validObject(.Object) : invalid class “phyloseq” object:
Component taxa/OTU names do not match.
Taxa indices are critical to analysis.
Try taxa_names()

If I remove the tree component this isn't a problem. Does anyone know what's going wrong?
Many thanks

@benjjneb
Copy link
Owner

I think if you remove the read.tree call, that this will work. You should be able to use fitGTR$tree directly as in the F1000 workflow.

read.tree is intended for reading trees into R from Newick-format files.

@rachaelantwis
Copy link

rachaelantwis commented Feb 22, 2018

Thanks - I still get the error message:

Error in validObject(.Object) : invalid class “phyloseq” object:
Component taxa/OTU names do not match.
Taxa indices are critical to analysis.
Try taxa_names()

When I try:

taxa_names(fitGTR)
NULL

This is what I get when I inspect fitGTR:

fitGTR

loglikelihood: -140407

unconstrained loglikelihood: -1712.417
Proportion of invariant sites: 0.1513297
Discrete gamma model
Number of rate categories: 4
Shape parameter: 0.4878634

Rate matrix:
a c g t
a 0.0000000 0.7836721 3.3499360 1.740702
c 0.7836721 0.0000000 0.5706275 4.415296
g 3.3499360 0.5706275 0.0000000 1.000000
t 1.7407018 4.4152960 1.0000000 0.000000

Base frequencies:
0.2315264 0.2043853 0.2952618 0.2688266

I'm assuming taxa_names should be populated in some way?
Many thanks

@benjjneb
Copy link
Owner

benjjneb commented Feb 22, 2018

Oh... the problem I think is that you have renamed the taxa as "SV" etc. in seqtab and tax, but not in the tree.

I recommend leaving the sequences as the names until you get everything into the phyloseq object, then use taxa_names(ps) <- new.names to rename the phyloseq object (and each slot in it) consistently.

@rachaelantwis
Copy link

rachaelantwis commented Feb 22, 2018

Ah thanks, this is working now! Code as follows (for the next person):

#library(DECIPHER)
seqs <- getSequences(seqtab.nochim)
names(seqs) <- seqs # This propagates to the tip labels of the tree
alignment <- AlignSeqs(DNAStringSet(seqs), anchor=NA)

#library(phangorn)
phang.align <- phyDat(as(alignment, "matrix"), type="DNA")
dm <- dist.ml(phang.align)
treeNJ <- NJ(dm) # Note, tip order != sequence order
fit = pml(treeNJ, data=phang.align)

#negative edges length changed to 0!
fitGTR <- update(fit, k=4, inv=0.2)
fitGTR <- optim.pml(fitGTR, model="GTR", optInv=TRUE, optGamma=TRUE,
rearrangement = "stochastic", control = pml.control(trace = 0))

  #Tidying Up Sample Data
    samples.out <- data.frame(Sample=rownames(seqtab.nochim))
  
  #Import sample dataset
  library(readr)
  Sample_data <- read_csv("/Users/rachaelantwis/Dropbox/Research/Current/Papers/Ponies/Pony_sample_data.csv")
  View(Sample_data)
  
    sample.info<-Sample_data
    head(sample.info)
    
  #Samples as factor
    sample.info$Sample<-as.factor(sample.info$Sample)

  #Merge the Sample ID data from dada2 with the metadata  
    sample.merge<-merge(sample.info,samples.out,by="Sample",all.y=TRUE)
    nrow(sample.merge)
    #tail(sample.merge)
    
  #Rownames of sample data are sample IDs
    rownames(sample.merge)<-sample.merge$Sample
    
  #Add an indicator variable for whether a sample is a sample or a negative
    sample.merge$sampletype<-"sample"
    sample.merge$sampletype[grep("BLANK",sample.merge$Sample)]<-"negative"
  
  #Export to Phyloseq
    library(phyloseq)

ps <- phyloseq(tax_table(tt.plus), sample_data(sample.merge), otu_table(seqtab.nochim, taxa_are_rows = FALSE), phy_tree(fitGTR$tree))

    taxa_names(ps) <- paste("SV",seq(nrow(tt.plus)))`

@ndanckert
Copy link

ndanckert commented Feb 27, 2018

Hi,

I am also having trouble with this section of the DADA2 script and am relatively new at this work. I am processing my 16s sequences through the DICIPHER / Phangorn pipeline from your published workflow.

#My script runs as follows:
Seqs <- getSequences(Seqtab)
names(Seqs) <- Seqs
alignment <- AlignSeqs(DNAStringSet(Seqs), anchor=NA)

phang.align <- phyDat(as(alignment, "matrix"), type="DNA")
dm <- dist.ml(phang.align)
tree.NJ <- NJ(dm)
fit = pml(tree.NJ, data=phang.align)

fitGTR <- update(fit, k=4, inv=0.2)
fitGTR <-optim.pml(fitGTR, model="GTR", optInv=TRUE, optGamma=TRUE, rearrangement = "stochastic", control = pml.control(trace = 0)) 
detach("package:phangorn", unload=TRUE)

After the 'fitGTR <-optim.pml(fitGTR, model="GTR", optInv=TRUE, optGamma=TRUE, rearrangement = "stochastic", control = pml.control(trace = 0))' command I receive the following error:

"Error in if (((ll1 - ll)/ll < control$eps) && rounds > 2) opti <- FALSE : 
  missing value where TRUE/FALSE needed
In addition: There were 50 or more warnings (use warnings() to see the first 50) 
#The error messages are:
1: In optimize(f = fn, interval = c(0.1, 500), lower = 0.1,  ... :
  NA/Inf replaced by maximum positive value"

Could someone please point me in the right direction to solve this.

Much appreciated.
Nathan

@benjjneb
Copy link
Owner

Hm....

There were no errors or warnings in the alignment or NJ tree code before this?

I'm not sure on this one, this is an error in phangorn code so I'm not an expert here, but I would take a look if you can share the fitGTR object causing the problem, i.e. saveRDS(fitGTR, "fitGTR.rds") right before the error-causing command, and post that here.

@ndanckert
Copy link

ndanckert commented Feb 28, 2018 via email

@benjjneb
Copy link
Owner

@ndanckert

I don't see it attached, might be too big. You can email me:

benjamin DOT j DOT callahan AT gmail DOT com

@benjjneb
Copy link
Owner

benjjneb commented Mar 3, 2018

@ndanckert

I was able to complete the command that is giving you trouble on my machine. That suggests this might be something solved by updating your libraries or R version. See my setup below. Are you using older versions of phangorn, ape or R?

sessionInfo()
R version 3.4.3 (2017-11-30)

Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] phangorn_2.3.1 ape_5.0

loaded via a namespace (and not attached):
[1] compiler_3.4.3 Matrix_1.2-12 magrittr_1.5 parallel_3.4.3 tools_3.4.3 igraph_1.1.2 yaml_2.1.15
[8] fastmatch_1.1-0 Rcpp_0.12.15 nlme_3.1-131 grid_3.4.3 pkgconfig_2.0.1 lattice_0.20-35 quadprog_1.5-5

@ndanckert
Copy link

ndanckert commented Mar 6, 2018 via email

@akalichen
Copy link

Hello, I have also run into a similar problem with this revised workflow "http://web.stanford.edu/class/bios221/MicrobiomeWorkflowII.html".
when I tried to execute this commandline.

phangAlign <- phyDat(as(alignment, "matrix"), type="DNA")
dm <- dist.ml(phangAlign)
treeNJ <- NJ(dm) # Note, tip order != sequence order

fit = pml(treeNJ, data=phangAlign)
fitGTR <- update(fit, k=4, inv=0.2)
fitGTR <- optim.pml(fitGTR, model="GTR", optInv=TRUE, optGamma=TRUE,
        rearrangement = "stochastic", control = pml.control(trace = 0))
detach("package:phangorn", unload=TRUE)

I get error message like this.

"NA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive value"

@akalichen
Copy link

my Rstudio and my packages were uptodate.

@benjjneb
Copy link
Owner

@akalichen Can you repost this issue at the GitHub issues page for the workflow itself? https://github.com/spholmes/F1000_workflow/issues

This error isn't really related to the dada2 package.

@akalichen
Copy link

Thanks benjjneb! I will do that right now.

@zhilongjia
Copy link

@jarrodscott how about the performance and speed of RaxML and RaxML-ng? Thank you.

@ricmedveterinario
Copy link

Hello,

I have the workflow after dada2:

#multiple-alignment
library("DECIPHER", lib.loc="~/R/win-library/3.4")
setwd ("C:/Users/AVELL/Desktop/2_fase")

seqs <- read.table(file = "seqs_2 _teste.txt")
seqs <- as.matrix(seqs)
names(seqs) <- seqs
alignment <- AlignSeqs(DNAStringSet(seqs), anchor=NA)

#Construct the phylogenetic tree
library("phangorn")

phang.align <- phyDat(as(alignment, "matrix"), type="DNA")
dm <- dist.ml(phang.align)
treeNJ <- NJ(dm)
treeNJ# Note, tip order != sequence order
fit = pml(treeNJ, data=phang.align)

But I have problems with the following error in FIT:

fit = pml(treeNJ, data=phang.align)

negative edges length changed to 0!
Error in pml(treeNJ, data = phang.align) : tip labels are not in data

Could someone please point me in the right direction to solve this.

Much appreciated.
Richard

@benjjneb
Copy link
Owner

seqs <- read.table(file = "seqs_2 _teste.txt")
seqs <- as.matrix(seqs)
names(seqs) <- seqs

This workflow is expecting seqs to be a character vector of the DNA sequences, not a matrix. You have read in a matrix it seems? You need to get seqs into the right format, and then it should work as expected.

@Aircus
Copy link

Aircus commented May 24, 2019

Thanks Ben, I've double checked my computer at uni and you're correct. The error was due to the software not being up to date. Thank you for the help.

On Sun, Mar 4, 2018 at 12:41 AM, Benjamin Callahan @.*** > wrote: @ndanckert https://github.com/ndanckert I was able to complete the command that is giving you trouble on my machine. That suggests this might be something solved by updating your libraries or R version. See my setup below. Are you using older versions of phangorn, ape or R? sessionInfo() R version 3.4.3 (2017-11-30) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Sierra 10.12.6 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/ A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.4/ Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] phangorn_2.3.1 ape_5.0 loaded via a namespace (and not attached): [1] compiler_3.4.3 Matrix_1.2-12 magrittr_1.5 parallel_3.4.3 tools_3.4.3 igraph_1.1.2 yaml_2.1.15 [8] fastmatch_1.1-0 Rcpp_0.12.15 nlme_3.1-131 grid_3.4.3 pkgconfig_2.0.1 lattice_0.20-35 quadprog_1.5-5 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#88 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AjMVCaa_-S64jN_Xt8gn2AZbXoR_8ki8ks5tap2agaJpZM4JA42u .

hi!
How didnt you solve this problem? I have the same error now

Best regards!
Tong

@khuch123
Copy link

khuch123 commented Sep 6, 2019

@giriarteS
The workflow you mentioned produced these errors.
Any help please?

@ChrisTrivedi
Copy link

If you need longer taxon names you can adapt the constant #define nmlngth 256 in file axml.h appropriately. Check the lenght of your SV ID (sequence) and change that parameter in the axml.h file. I just gave a short name to each single variant (SV1.....SVn), I work with ITS and my sequence lengths are really variable. Below I describe my workflow:
seqtab.nochim1 <- seqtab.nochim
colnames(seqtab.nochim1) <- paste0("SV", 1:ncol(seqtab.nochim1))
Then added taxonomy to the original seqtab.nochim
taxtab <- assignTaxonomy(seqtab.nochim, ref_fasta_UNITE, minBoot = 50, tryRC = TRUE,
outputBootstraps = FALSE, taxLevels = c("Kingdom", "Phylum", "Class",
"Order", "Family", "Genus", "Species"), multithread = TRUE,
verbose = TRUE)
and change the names again (the order was the same as in seqtab.nochim)
taxtab1 <- taxtab
rownames(taxtab1) <- paste0("SV", 1:nrow(taxtab1))

The SV table and SV sequences were exported also:
export_taxa_table_and_seqs = function(seqtab.nochim, file_seqtab, file_seqs) {
seqtab.t = as.data.frame(t(seqtab.nochim))
seqs = row.names(seqtab.t)
row.names(seqtab.t) = paste0("SV", 1:nrow(seqtab.t))
outlist = list(data_loaded = seqtab.t)
mctoolsr::export_taxa_table(outlist, file_seqtab)
seqs = as.list(seqs)
seqinr::write.fasta(seqs, row.names(seqtab.t), file_seqs)
}
export_taxa_table_and_seqs(seqtab.nochim,"/path_to your_file/SV_table.txt",
"/path_to your_file/SV_seqs.fa")

Finally you can load your SV fasta file and align the sequences with DECIPHER:
fas <- "SV_seqs.fa"
seqs <- readDNAStringSet(fas)
seqs <- OrientNucleotides(seqs) ## I found duplicated SVs after I reoriented the sequences
alignment <- AlignSeqs(seqs, anchor=NA)

I exported my alignment to a server to run RAXML, it took 9 days for 5000 sequences.
phang.align <- phyDat(as(alignment, "matrix"), type="DNA")
write.phyDat(phang.align, file="alignment.fasta", format="fasta")

TS3alignment <- read.dna("alignment.fasta",format="fasta",as.matrix=TRUE)
alignment.rax.gtr <- raxml(myalignment,
m="GTRGAMMAIX", # model
f="a", # best tree and bootstrap
p=1234, # random number seed
x=2345, # random seed for rapid bootstrapping
N=100, # number of bootstrap replicates
file="alignment", # name of output files
exec="raxmlHPC-PTHREADS-SSE3", # name of executable
threads=20
)

Then you incorporate everything to your phyloseq object:
tree <- read_tree("RAxML_bipartitionsBranchLabels.alignment")
metadata_path <- "Metadata.csv"
samdf <- read.csv(metadata_path, header=TRUE)
all(rownames(seqtab.nochim1) %in% samdf$SampleID)
rownames(samdf) <- samdf$SampleID

ps <- phyloseq(tax_table(taxtab1), sample_data(samdf),otu_table(seqtab.nochim1, taxa_are_rows = FALSE), phy_tree(tree))
saveRDS(ps, file="ps.rds")

This workflow works well, if any of the steps seen redundant or if anybody has suggestions on how to improve it, please let me know.

Hi all, apologies for posting on a closed topic, but I'm having a few issues with incorporating @giriarteS 's workflow. I'm using it for 16S/18S/ITS which have >5000 seqs to align. The one thing I don't want to do (especially for 16/18S) is rename my sequences to SV1...

I tried to skip those portions of the workflow by using the normal seqtabl.nochim and assign taxonomy instructions by B. Callahan - then going directly into the DECIPHER alignment:

tree.sequences <- getSequences(seqtab.nochim)
names(tree.sequences) <- tree.sequences  # this propagates to the tip labels of the tree
tree.alignment <- AlignSeqs(DNAStringSet(tree.sequences), anchor=NA)
phang_align <- phyDat(as(tree.alignment, 'matrix'), type='DNA')
write.phyDat(phang_align, file="alignment.fasta", format="fasta")
16S_alignment <- read.dna("alignment2.fasta",format="fasta",as.matrix=TRUE)

but I get an error from the Ape package (I'm assuming), which says Error: unexpected symbol in "16S_alignment" which I can't seem to track.

Additionally, I tried to skip phangorn altogether by exporting the alignment from DECIPHER and reimporting via Ape:

tree.sequences <- getSequences(seqtab.nochim)
names(tree.sequences) <- tree.sequences  # this propagates to the tip labels of the tree
tree.alignment <- AlignSeqs(DNAStringSet(tree.sequences), anchor=NA)
writeXStringSet(tree.alignment, file="alignment2.fasta")
16S_alignment <- read.dna("alignment2.fasta",format="fasta",as.matrix=TRUE)

But I was also unsuccessful. Any help would be appreciated. Not quite sure what I'm doing wrong, and maybe this is a better question for a DECIPHER/Ape thread.

@ZuaniaColon
Copy link

Hi, is there a way to export and read the tree "alignment.rax.gtr"? Because of time constrains, I would like to run two different codes. I saw that you read it with

tree <- read_tree("RAxML_bipartitionsBranchLabels.alignment"),

but I cannot follow when and how you saved it.

Thanks,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests