Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support extended Newick? #193

Closed
cossio opened this issue Jan 21, 2023 · 7 comments
Closed

Support extended Newick? #193

cossio opened this issue Jan 21, 2023 · 7 comments

Comments

@cossio
Copy link

cossio commented Jan 21, 2023

It seems the package cannot read extended Newick format trees?

I'm having trouble in particular reading trees stored in RFAM (https://docs.rfam.org/en/latest/api.html?highlight=nhx#tree-data). For example,

http://rfam.org/family/RF00360/tree/

@crsl4
Copy link
Member

crsl4 commented Jan 21, 2023

Hello! The package can read extended Newick, but there are many different flavors, so maybe there are subtleties involved. I tried to open the example link, but couldn't. The Rfam documentation denotes this an NHX format which I am not familiar with. The extended Newick format that we can handled is explained in the package wiki.
If you can provide an example of the NHX format, or a description, maybe we can figure out if we can read it in PhyloNetworks, and if not, maybe create a function that does. Thanks!

@cecileane
Copy link
Member

To clarify, "extended" newick could mean different things to different people.

  • For some, the extended format allows for reticulations, to code networks, not just trees. I think that's what @crsl4 meant. And this type of extended newick is definitely parsed by readTopology.
  • For others, the extended format allows for extra node and edge information within the parenthetical newick description of a tree (like credibility interval of the age of a node, rate and its credibility interval for an edge, etc.). That's perhaps what you meant, @cossio . There are many such extensions, with different (conflicting?) ways to add extra node & edge information. It makes it hard (impossible?) to parse all forms of this type of extension. The function readTopology can parse some forms, those using nexus-style comments, like: [&...]. This format is used by BEAST for example.

@cecileane
Copy link
Member

@crsl4 : I was able to open the link, copy-pasted below.

(134.00_AJ307928.1/3-121_Oryza_sativa_{rice}[4530].2:0.00053,132.80_AJ307662.1/4013-4128_Oryza_sativa_{rice}[4530].3:0.00857,(((136.20_AY013245.2/57789-57907_Oryza_sativa_Japonica_G..[39947].2:0.00055,(134.50_AY013245.2/61987-62105_Oryza_sativa_Japonica_G..[39947].1:0.00836,135.10_AJ489952.1/1-119_Oryza_sativa_{rice}[4530].1:0.00055)1.000:0.05288)0.380:0.00702,(105.20_AC135465.23/41412-41522_Medicago_truncatula_{ba..[3880].1:0.27549,(_AJ298135.1/1-115_Arabidopsis_thaliana_{t..[3702].1:0.15425,74.50_AF318011.1/1-106_Arabidopsis_thaliana_{t..[3702].2:0.32379)0.010:0.07279)0.960:0.16629)0.740:0.00990,73.00_AJ489954.1/1-104_Oryza_sativa_{rice}[4530].4:0.29332)0.790:0.01696);

@cossio
Copy link
Author

cossio commented Jan 21, 2023

Dear @cecileane, thanks for your comments it helps clarify the situation. Since you were able to open the Rfam example, can you confirm if you were able to parse it or not? For me, it gives an error.

Other than that, I can't say at the moment what flavor of Newick Rfam is using (this "NHX"). I tried contacting them and will post here if I get a reply.

@cecileane
Copy link
Member

Yes I confirm. The error I get gives this message:

ERROR: read '[' but not followed by &

The parser expects a nexus-style comment after an opening square bracket, and these comments should start with [&.

It means that things like [4530] cause an issue to parse the tree. We can remove them prior to parsing the trees, with this regular expression \[\d+\] to search for the pattern [...digits...], either with a text editor that accepts regex to search & replace, or with sed to do this fast on one or multiple files. In the example above, there were 9 such instances of this pattern. After removing them, readTopology was able to parse the tree. The taxon labels seem weird though:

julia> net = readTopology("(134.00_AJ307928.1/3-121_Oryza_sativa_{rice}.2:0.00053,132.80_AJ307662.1/4013-4128_Oryza_sativa_{rice}.3:0.00857,(((136.20_AY013245.2/57789-57907_Oryza_sativa_Japonica_G...2:0.00055,(134.50_AY013245.2/61987-62105_Oryza_sativa_Japonica_G...1:0.00836,135.10_AJ489952.1/1-119_Oryza_sativa_{rice}.1:0.00055)1.000:0.05288)0.380:0.00702,(105.20_AC135465.23/41412-41522_Medicago_truncatula_{ba...1:0.27549,(_AJ298135.1/1-115_Arabidopsis_thaliana_{t...1:0.15425,74.50_AF318011.1/1-106_Arabidopsis_thaliana_{t...2:0.32379)0.010:0.07279)0.960:0.16629)0.740:0.00990,73.00_AJ489954.1/1-104_Oryza_sativa_{rice}.4:0.29332)0.790:0.01696);");

julia> tipLabels(net)
9-element Vector{String}:
 "134.00_AJ307928.1/3-121_Oryza_sativa_{rice}.2"
 "132.80_AJ307662.1/4013-4128_Oryza_sativa_{rice}.3"
 "136.20_AY013245.2/57789-57907_Oryza_sativa_Japonica_G...2"
 "134.50_AY013245.2/61987-62105_Oryza_sativa_Japonica_G...1"
 "135.10_AJ489952.1/1-119_Oryza_sativa_{rice}.1"
 "105.20_AC135465.23/41412-41522_Medicago_truncatula_{ba...1"
 "_AJ298135.1/1-115_Arabidopsis_thaliana_{t...1"
 "74.50_AF318011.1/1-106_Arabidopsis_thaliana_{t...2"
 "73.00_AJ489954.1/1-104_Oryza_sativa_{rice}.4"

The things I removed, like [4530], seem to have been part of the taxon names. The numbers could be included in the taxon names by simply removing the square brackets, and leaving in the digits in between them. Depending on what's needed for the taxon names, a different regular expression should be used.

@cecileane
Copy link
Member

PhyloNetworks v0.16.1 now has a better parser for trees & networks written by some of BEAST2 packages. This format conflicts with the format you had, unfortunately. So I am going to provide a programmatic solution to the error you encountered, and close the issue.

Here is the newick format you had:

rfam_nwk = "(134.00_AJ307928.1/3-121_Oryza_sativa_{rice}[4530].2:0.00053,132.80_AJ307662.1/4013-4128_Oryza_sativa_{rice}[4530].3:0.00857,(((136.20_AY013245.2/57789-57907_Oryza_sativa_Japonica_G..[39947].2:0.00055,(134.50_AY013245.2/61987-62105_Oryza_sativa_Japonica_G..[39947].1:0.00836,135.10_AJ489952.1/1-119_Oryza_sativa_{rice}[4530].1:0.00055)1.000:0.05288)0.380:0.00702,(105.20_AC135465.23/41412-41522_Medicago_truncatula_{ba..[3880].1:0.27549,(_AJ298135.1/1-115_Arabidopsis_thaliana_{t..[3702].1:0.15425,74.50_AF318011.1/1-106_Arabidopsis_thaliana_{t..[3702].2:0.32379)0.010:0.07279)0.960:0.16629)0.740:0.00990,73.00_AJ489954.1/1-104_Oryza_sativa_{rice}[4530].4:0.29332)0.790:0.01696);";

Here is a way to remove the brackets and the numbers inside, and the dots preceding the opening brackets, using regular expressions:

nwk1 = replace(rfam_nwk, r"\.*\[\d+\]" => "")
"(134.00_AJ307928.1/3-121_Oryza_sativa_{rice}.2:0.00053,132.80_AJ307662.1/4013-4128_Oryza_sativa_{rice}.3:0.00857,(((136.20_AY013245.2/57789-57907_Oryza_sativa_Japonica_G.2:0.00055,(134.50_AY013245.2/61987-62105_Oryza_sativa_Japonica_G.1:0.00836,135.10_AJ489952.1/1-119_Oryza_sativa_{rice}.1:0.00055)1.000:0.05288)0.380:0.00702,(105.20_AC135465.23/41412-41522_Medicago_truncatula_{ba.1:0.27549,(_AJ298135.1/1-115_Arabidopsis_thaliana_{t.1:0.15425,74.50_AF318011.1/1-106_Arabidopsis_thaliana_{t.2:0.32379)0.010:0.07279)0.960:0.16629)0.740:0.00990,73.00_AJ489954.1/1-104_Oryza_sativa_{rice}.4:0.29332)0.790:0.01696);"

alternatively, the expression below would remove the brackets only, not what's inside, and nothing preceding them, and would replace the brackets by underscores:

nwk2 = replace(rfam_nwk, r"[\[\]]" => "_")
"(134.00_AJ307928.1/3-121_Oryza_sativa_{rice}_4530_.2:0.00053,132.80_AJ307662.1/4013-4128_Oryza_sativa_{rice}_4530_.3:0.00857,(((136.20_AY013245.2/57789-57907_Oryza_sativa_Japonica_G.._39947_.2:0.00055,(134.50_AY013245.2/61987-62105_Oryza_sativa_Japonica_G.._39947_.1:0.00836,135.10_AJ489952.1/1-119_Oryza_sativa_{rice}_4530_.1:0.00055)1.000:0.05288)0.380:0.00702,(105.20_AC135465.23/41412-41522_Medicago_truncatula_{ba.._3880_.1:0.27549,(_AJ298135.1/1-115_Arabidopsis_thaliana_{t.._3702_.1:0.15425,74.50_AF318011.1/1-106_Arabidopsis_thaliana_{t.._3702_.2:0.32379)0.010:0.07279)0.960:0.16629)0.740:0.00990,73.00_AJ489954.1/1-104_Oryza_sativa_{rice}_4530_.4:0.29332)0.790:0.01696);"

Either form can be parsed successfully. Only the taxon names will differ:

net1 = readTopology(nwk1);
tipLabels(net1)[3] # third taxon name
"136.20_AY013245.2/57789-57907_Oryza_sativa_Japonica_G.2"

net2 = readTopology(nwk2);
tipLabels(net2)[3]
"136.20_AY013245.2/57789-57907_Oryza_sativa_Japonica_G.._39947_.2"

Now, what if this newick format is used in many trees within a file?
If it's a nexus file, we can use the new function readnexus_treeblock, which accepts an option stringmodifier to modify the newick string before it's parsed. Just like we did above. To show an example, I'll first create a nexus-formatted file that has the rfam-newick string as the single tree in a tree block:

write("rfam.nex", """
#nexus
begin trees;
tree gt = $rfam_nwk
end;
""")

With v0.16.1, we can read this file successfully, if we pass it the desired string modifier:

treelist = readnexus_treeblock("rfam.nex", stringmodifier = [r"\.*\[\d+\]" => ""]); # same as in net1
length(treelist) # 1
tipLabels(treelist[1])[3] # third taxon name in 1st (and only) tree in the list
"136.20_AY013245.2/57789-57907_Oryza_sativa_Japonica_G.2"

@cossio
Copy link
Author

cossio commented Apr 11, 2023

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants