Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validation of Newick output #16

Open
petermr opened this issue Aug 7, 2015 · 7 comments
Open

validation of Newick output #16

petermr opened this issue Aug 7, 2015 · 7 comments
Assignees

Comments

@petermr
Copy link
Member

petermr commented Aug 7, 2015

Provide a mechanism for validating *.nwk output.

As an example http://libpll.org/api/group__newickParseGroup.html defines a valid tree as

Validate if a newick tree is a valid phylogenetic tree.

A valid tree is one where the root node is binary or ternary and all other internal nodes are binary. In case the root is ternary then the tree must contain at least another internal node and the total number of nodes must be equal to $ 2l - 2$, where $l$ is the number of leaves. If the root is binary, then the total number of nodes must be equal to $2l - 1$.

This implies that all multiple parentage should be expanded to binary trees apart from roots.

Is this a satisfactory validator? and does it validate node labels, etc.

@rossmounce
Copy link
Member

If that is implying that all trees have to be binary then no, that is not correct.

It is permitted in Newick to have a tri-furcation e.g. (A(B,C,D) or larger polytomy (A(B(C,D,E,F,G))).
There are unfortunately many slightly different ways of writing Newick.

@petermr
Copy link
Member Author

petermr commented Aug 7, 2015

On Fri, Aug 7, 2015 at 4:49 PM, Ross Mounce notifications@github.com
wrote:

If that is implying that all trees have to be binary then no, that is not
correct.

That's what it implies, and it calls itself a Validator.

It is permitted in Newick to have a tri-furcation e.g. (A(B,C,D) or larger
polytomy (A(B(C,D,E,F,G))).
There are unfortunately many slightly different ways of writing Newick
https://en.wikipedia.org/wiki/Newick_format.

That's exactly why it's a problem. It may mean that I will have to create a
STK2-specific Newick. In any case the transfer has to be validated.

So the likelihood is that we have a single file of 5000 lines with Newick
in? In which case we will at some stage need a tool to summarize the CTrees
and create one [1].

[1] Yes we can find/grep/cat to concatenate output, but ultimately
summarisation should be done in AMI using some form of map/reduce strategy.

Reply to this email directly or view it on GitHub
#16 (comment)
.

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

@petermr
Copy link
Member Author

petermr commented Aug 7, 2015

Does this mean that we can try with a small number of trees to test whether the supertree workflow works (even if the answers are not meaningful)?

@petermr
Copy link
Member Author

petermr commented Aug 7, 2015

From Ross

??? Do you mean input into STK2 from ami? We just need to concatenate all the *.nwk files into one big *.tre file for STK2. One nwk per line in the STK2. No additional re-shaping or reformatting (provided that the taxon names have already been standardised). At most it will entail the subtraction or addition of semicolons at the end of each line.

We haven't decided where the *.nwk files are in the Ctree. Since there could be >1 image there will be >1 *.nwk

I have validated the Newick generated by AMI this morning. I used the command line mode of TreeGraph 2 to generate new images of the trees in .png & .svg from the .nwk files. 2195 / 2211 were successfully interpreted. Sorry I have not reported this sooner. I will get up details about the errors in the error folder on phylotree ASAP

So you will flag 16 files as errors in an issue, explain what is wrong and assign them as an issue for me?

I trust TreeGraph 2 as a validator. Some like DendroPy (Python) are useful but too strict - they throw a fit at all the unlabelled taxa, so not so useful at this stage.

That's your shout. My point is that I have to know that AMI output is valid. It sounds like some of it isn't

@rossmounce
Copy link
Member

Just uploaded it all to https://github.com/ContentMine/phylotree/tree/master/errors/TreeGraph2-validation-tests

I have now posted a separate issue here: #17 for the specific files which appear to be erroneous

@rossmounce rossmounce changed the title valiadtion of Newick output validation of Newick output Aug 7, 2015
@petermr
Copy link
Member Author

petermr commented Aug 7, 2015

There's an error in Github:

Sorry, we had to truncate this directory to 1,000 files. 7,798 entries were
omitted from the list.

On Fri, Aug 7, 2015 at 5:49 PM, Ross Mounce notifications@github.com
wrote:

Just uploaded it all to
https://github.com/ContentMine/phylotree/tree/master/errors/TreeGraph2-validation-tests

I will post a separate issue for the specific files which appear to be
erroneous


Reply to this email directly or view it on GitHub
#16 (comment)
.

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

@petermr
Copy link
Member Author

petermr commented Aug 7, 2015

Are these all files in error? (.../errors/TreeGraph2-validation-tests
https://github.com/ContentMine/phylotree/tree/master/errors/TreeGraph2-validation-tests
)

We need a description of what these files are. They look like potential
input for tests, not errors.

On Fri, Aug 7, 2015 at 6:23 PM, Peter Murray-Rust <
peter.murray.rust@googlemail.com> wrote:

There's an error in Github:

Sorry, we had to truncate this directory to 1,000 files. 7,798 entries
were omitted from the list.

On Fri, Aug 7, 2015 at 5:49 PM, Ross Mounce notifications@github.com
wrote:

Just uploaded it all to
https://github.com/ContentMine/phylotree/tree/master/errors/TreeGraph2-validation-tests

I will post a separate issue for the specific files which appear to be
erroneous


Reply to this email directly or view it on GitHub
#16 (comment)
.

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants