-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support extended Newick? #193
Comments
Hello! The package can read extended Newick, but there are many different flavors, so maybe there are subtleties involved. I tried to open the example link, but couldn't. The Rfam documentation denotes this an NHX format which I am not familiar with. The extended Newick format that we can handled is explained in the package wiki. |
To clarify, "extended" newick could mean different things to different people.
|
@crsl4 : I was able to open the link, copy-pasted below.
|
Dear @cecileane, thanks for your comments it helps clarify the situation. Since you were able to open the Rfam example, can you confirm if you were able to parse it or not? For me, it gives an error. Other than that, I can't say at the moment what flavor of Newick Rfam is using (this "NHX"). I tried contacting them and will post here if I get a reply. |
Yes I confirm. The error I get gives this message:
The parser expects a nexus-style comment after an opening square bracket, and these comments should start with It means that things like julia> net = readTopology("(134.00_AJ307928.1/3-121_Oryza_sativa_{rice}.2:0.00053,132.80_AJ307662.1/4013-4128_Oryza_sativa_{rice}.3:0.00857,(((136.20_AY013245.2/57789-57907_Oryza_sativa_Japonica_G...2:0.00055,(134.50_AY013245.2/61987-62105_Oryza_sativa_Japonica_G...1:0.00836,135.10_AJ489952.1/1-119_Oryza_sativa_{rice}.1:0.00055)1.000:0.05288)0.380:0.00702,(105.20_AC135465.23/41412-41522_Medicago_truncatula_{ba...1:0.27549,(_AJ298135.1/1-115_Arabidopsis_thaliana_{t...1:0.15425,74.50_AF318011.1/1-106_Arabidopsis_thaliana_{t...2:0.32379)0.010:0.07279)0.960:0.16629)0.740:0.00990,73.00_AJ489954.1/1-104_Oryza_sativa_{rice}.4:0.29332)0.790:0.01696);");
julia> tipLabels(net)
9-element Vector{String}:
"134.00_AJ307928.1/3-121_Oryza_sativa_{rice}.2"
"132.80_AJ307662.1/4013-4128_Oryza_sativa_{rice}.3"
"136.20_AY013245.2/57789-57907_Oryza_sativa_Japonica_G...2"
"134.50_AY013245.2/61987-62105_Oryza_sativa_Japonica_G...1"
"135.10_AJ489952.1/1-119_Oryza_sativa_{rice}.1"
"105.20_AC135465.23/41412-41522_Medicago_truncatula_{ba...1"
"_AJ298135.1/1-115_Arabidopsis_thaliana_{t...1"
"74.50_AF318011.1/1-106_Arabidopsis_thaliana_{t...2"
"73.00_AJ489954.1/1-104_Oryza_sativa_{rice}.4" The things I removed, like |
PhyloNetworks v0.16.1 now has a better parser for trees & networks written by some of BEAST2 packages. This format conflicts with the format you had, unfortunately. So I am going to provide a programmatic solution to the error you encountered, and close the issue. Here is the newick format you had: rfam_nwk = "(134.00_AJ307928.1/3-121_Oryza_sativa_{rice}[4530].2:0.00053,132.80_AJ307662.1/4013-4128_Oryza_sativa_{rice}[4530].3:0.00857,(((136.20_AY013245.2/57789-57907_Oryza_sativa_Japonica_G..[39947].2:0.00055,(134.50_AY013245.2/61987-62105_Oryza_sativa_Japonica_G..[39947].1:0.00836,135.10_AJ489952.1/1-119_Oryza_sativa_{rice}[4530].1:0.00055)1.000:0.05288)0.380:0.00702,(105.20_AC135465.23/41412-41522_Medicago_truncatula_{ba..[3880].1:0.27549,(_AJ298135.1/1-115_Arabidopsis_thaliana_{t..[3702].1:0.15425,74.50_AF318011.1/1-106_Arabidopsis_thaliana_{t..[3702].2:0.32379)0.010:0.07279)0.960:0.16629)0.740:0.00990,73.00_AJ489954.1/1-104_Oryza_sativa_{rice}[4530].4:0.29332)0.790:0.01696);"; Here is a way to remove the brackets and the numbers inside, and the dots preceding the opening brackets, using regular expressions: nwk1 = replace(rfam_nwk, r"\.*\[\d+\]" => "")
"(134.00_AJ307928.1/3-121_Oryza_sativa_{rice}.2:0.00053,132.80_AJ307662.1/4013-4128_Oryza_sativa_{rice}.3:0.00857,(((136.20_AY013245.2/57789-57907_Oryza_sativa_Japonica_G.2:0.00055,(134.50_AY013245.2/61987-62105_Oryza_sativa_Japonica_G.1:0.00836,135.10_AJ489952.1/1-119_Oryza_sativa_{rice}.1:0.00055)1.000:0.05288)0.380:0.00702,(105.20_AC135465.23/41412-41522_Medicago_truncatula_{ba.1:0.27549,(_AJ298135.1/1-115_Arabidopsis_thaliana_{t.1:0.15425,74.50_AF318011.1/1-106_Arabidopsis_thaliana_{t.2:0.32379)0.010:0.07279)0.960:0.16629)0.740:0.00990,73.00_AJ489954.1/1-104_Oryza_sativa_{rice}.4:0.29332)0.790:0.01696);" alternatively, the expression below would remove the brackets only, not what's inside, and nothing preceding them, and would replace the brackets by underscores: nwk2 = replace(rfam_nwk, r"[\[\]]" => "_")
"(134.00_AJ307928.1/3-121_Oryza_sativa_{rice}_4530_.2:0.00053,132.80_AJ307662.1/4013-4128_Oryza_sativa_{rice}_4530_.3:0.00857,(((136.20_AY013245.2/57789-57907_Oryza_sativa_Japonica_G.._39947_.2:0.00055,(134.50_AY013245.2/61987-62105_Oryza_sativa_Japonica_G.._39947_.1:0.00836,135.10_AJ489952.1/1-119_Oryza_sativa_{rice}_4530_.1:0.00055)1.000:0.05288)0.380:0.00702,(105.20_AC135465.23/41412-41522_Medicago_truncatula_{ba.._3880_.1:0.27549,(_AJ298135.1/1-115_Arabidopsis_thaliana_{t.._3702_.1:0.15425,74.50_AF318011.1/1-106_Arabidopsis_thaliana_{t.._3702_.2:0.32379)0.010:0.07279)0.960:0.16629)0.740:0.00990,73.00_AJ489954.1/1-104_Oryza_sativa_{rice}_4530_.4:0.29332)0.790:0.01696);" Either form can be parsed successfully. Only the taxon names will differ: net1 = readTopology(nwk1);
tipLabels(net1)[3] # third taxon name
"136.20_AY013245.2/57789-57907_Oryza_sativa_Japonica_G.2"
net2 = readTopology(nwk2);
tipLabels(net2)[3]
"136.20_AY013245.2/57789-57907_Oryza_sativa_Japonica_G.._39947_.2" Now, what if this newick format is used in many trees within a file? write("rfam.nex", """
#nexus
begin trees;
tree gt = $rfam_nwk
end;
""") With v0.16.1, we can read this file successfully, if we pass it the desired string modifier: treelist = readnexus_treeblock("rfam.nex", stringmodifier = [r"\.*\[\d+\]" => ""]); # same as in net1
length(treelist) # 1
tipLabels(treelist[1])[3] # third taxon name in 1st (and only) tree in the list
"136.20_AY013245.2/57789-57907_Oryza_sativa_Japonica_G.2" |
Thanks! |
It seems the package cannot read extended Newick format trees?
I'm having trouble in particular reading trees stored in RFAM (https://docs.rfam.org/en/latest/api.html?highlight=nhx#tree-data). For example,
http://rfam.org/family/RF00360/tree/
The text was updated successfully, but these errors were encountered: