Skip to content

Commit

Permalink
bug fix in snaq with verbose=true
Browse files Browse the repository at this point in the history
writeObsCF deprecate, use writeTableCF instead
more doc for multiple alleles
ngenes::Float64, not Number in Quartet type
  • Loading branch information
cecileane committed Jan 14, 2019
1 parent eb17c70 commit 61cd29f
Show file tree
Hide file tree
Showing 10 changed files with 104 additions and 34 deletions.
1 change: 1 addition & 0 deletions docs/src/lib/public.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ readSnaqNetwork
readTrees2CF
readTableCF
readTableCF!
writeTableCF
readBootstrapTrees
writeSubTree!
writeTopology
Expand Down
13 changes: 7 additions & 6 deletions docs/src/man/inputdata.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,13 +75,14 @@ nothing # hide
To read in all gene trees and directly summarize them by a list
of quartet CFs (proportion of input trees with a given quartet):
```@repl qcf
raxmlCF = readTrees2CF(raxmltrees, CFfile="tableCF.txt");
PhyloNetworks.writeObsCF(raxmlCF) # observed CFs: gene frequencies
rm("tableCF.txt") # hide
raxmlCF = readTrees2CF(raxmltrees, CFfile="tableCF.csv");
df = writeTableCF(raxmlCF) # data frame with observed CFs: gene frequencies
CSV.write("tableCF.csv", df) # to save the data frame to a file
rm("tableCF.csv") # hide
rm("summaryTreesQuartets.txt") # hide
```
`less("tableCF.txt")` lets you see the content of the newly created
file "tableCF.txt", within Julia. Again, type `q` to quit viewing this file.
`less("tableCF.csv")` lets you see the content of the newly created
file "tableCF.csv", within Julia. Again, type `q` to quit viewing this file.

In this table, each 4-taxon set is listed in one row.
The 3 "CF" columns gives the proportion of genes that has
Expand Down Expand Up @@ -132,7 +133,7 @@ using CSV, DataFrames
dat = CSV.read(buckyCFfile);
first(dat, 6) # to see the first 6 rows
buckyCF = readTableCF(dat)
PhyloNetworks.writeObsCF(buckyCF)
writeTableCF(buckyCF)
```
In the input file, columns need to be in the right order:
with the first 4 columns giving the names of the taxa in each 4-taxon set.
Expand Down
44 changes: 43 additions & 1 deletion docs/src/man/multiplealleles.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,56 @@ species level: with the allele names replaced by the appropriate species names.
The second command modifies this data frame `df_sp` by deleting rows that are uninformative
about between-species relationships, such as rows corresponding to 4 individuals from the
same species. The output `d_sp` of this second command is an object of type `DataCF` at the
species level, which can be used as input for networks estimation with `snaq!`:
species level, which can be used as input for networks estimation with `snaq!`.
But before, it is safe to save the concordance factor of quartets of species,
which can be calculated by averaging the CFs of quartets of individuals
from the associated species:

```julia
df_sp = writeTableCF(d_sp) # data frame, quartet CFs averaged across individuals of same species
CSV.write("CFtable_species.csv", df_sp) # save to file
```

Some quartets have the same species repeated twice,
representing cases when 2 of the 4 individuals came from the same species.
These quartets, with repeated species, are informative about the population
size of extant populations, i.e. about the lengths of external branches in
coalescent units.

now we can run snaq:

```julia
net = snaq!(T_sp, d_sp);
```
where `T_sp` should be a starting topology with one tip per species,
labelled with the same species names as the names used in the mapping file.

If `snaq!` takes too long that way, we can try a less ambitious estimation
that does not estimate the external branch lengths, that is,
*without* using quartets that have 2 individuals from the same species.
To do so, we can use the quartet concordance factors at the species level,
but filter out the quartets with one (or more) species repeated:

```julia
df_sp = writeTableCF(d_sp) # some quartets have the same species twice
function hasrep(row) # see if a row (4-taxon set) has a species name ending with "__2": repeated species
occursin(r"__2$", row[:tx1]) || occursin(r"__2$", row[:tx2]) ||
occursin(r"__2$", row[:tx3]) || occursin(r"__2$", row[:tx4])
end
df_sp_reduced = filter(!hasrep, df_sp) # removes rows with repeated species
df_sp_reduced # should have fewer rows than df_sp
CSV.write("CFtable_species_norep.csv", df_sp_reduced) # to save to file
d_sp_reduced = readTableCF(df_sp_reduced) # DataCF object, for input to snaq!
```

and now we can run `snaq!` on the reduced set of quartets without repeats,
which should be faster:

```julia
net = snaq!(T_sp, d_sp_reduced);
```


Warnings:

- This feature has not been fully tested
Expand Down
2 changes: 2 additions & 0 deletions src/PhyloNetworks.jl
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ module PhyloNetworks
readTrees2CF,
readTableCF,
readTableCF!,
writeTableCF,
readInputTrees,
readNexusTrees,
summarizeDataCF,
Expand Down Expand Up @@ -164,5 +165,6 @@ module PhyloNetworks
include("biconnectedComponents.jl")
include("traitsLikDiscrete.jl")
include("ticr.jl")
include("deprecated.jl")

end #module
1 change: 1 addition & 0 deletions src/deprecated.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
@deprecate writeObsCF writeTableCF
2 changes: 1 addition & 1 deletion src/multipleAlleles.jl
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ function mapAllelesCFtable!(cfDF::DataFrame, alleleDF::DataFrame, co::Vector{Int
end

# function to clean a df after changing allele names to species names
# inside mapAllelesCFtable
# inside readTableCF!
# by deleting rows that are not informative like sp1 sp1 sp1 sp2
# keepOne=true: we only keep one allele per species
function cleanAlleleDF!(newdf::DataFrame, cols::Vector{Int};keepOne=false::Bool)
Expand Down
4 changes: 2 additions & 2 deletions src/optimization.jl
Original file line number Diff line number Diff line change
Expand Up @@ -398,8 +398,8 @@ function optBL!(net::HybridNetwork, d::DataCF, verbose::Bool, ftolRel::Float64,
if verbose println("OPTBL: starting point $(ht)") # to stdout
else @debug "OPTBL: starting point $(ht)"; end # to logger if debug turned on by user
fmin, xmin, ret = NLopt.optimize(opt,ht)
if verbose println("got $(round(fmin, digits=5)) at $(round(xmin, digits=5)) after $(count) iterations (returned $(ret))")
else @debug "got $(round(fmin, digits=5)) at $(round(xmin, digits=5)) after $(count) iterations (returned $(ret))"; end
if verbose println("got $(round(fmin, digits=5)) at $(round.(xmin, digits=5)) after $(count) iterations (returned $(ret))")
else @debug "got $(round(fmin, digits=5)) at $(round.(xmin, digits=5)) after $(count) iterations (returned $(ret))"; end
updateParameters!(net,xmin)
net.loglik = fmin
#return fmin,xmin
Expand Down
43 changes: 28 additions & 15 deletions src/readData.jl
Original file line number Diff line number Diff line change
Expand Up @@ -14,20 +14,33 @@ end

writeExpCF(d::DataCF) = writeExpCF(d.quartet)

# function to write a csv table from the obsCF of an
# array of quartets
function writeObsCF(quartets::Array{Quartet,1})
"""
writeTableCF(vector of quartets)
writeTableCF(DataCF)
Build a DataFrame containing observed quartet concordance factors,
with columns named:
- `:tx1`, `:tx2`, `:tx3`, `:tx4` for the four taxon names in each quartet
- `:CF12_34`, `:CF13_24`, `:CF14_23` for the 3 quartets of a given four-taxon set
- `:ngenes` if this information is available for some quartets
"""
function writeTableCF(quartets::Array{Quartet,1})
df = DataFrames.DataFrame(t1=String[],t2=String[],t3=String[],t4=String[],
CF12_34=Float64[],CF13_24=Float64[],CF14_23=Float64[],ngenes=Int[])
CF12_34=Float64[],CF13_24=Float64[],CF14_23=Float64[],
ngenes=Union{Missing, Float64}[])
for q in quartets
length(q.taxon) == 4 || error("quartet $(q.number) does not have 4 taxa")
length(q.obsCF) == 3 || error("quartet $(q.number) does have qnet with 3 expCF")
push!(df, [q.taxon[1],q.taxon[2],q.taxon[3],q.taxon[4],q.obsCF[1],q.obsCF[2],q.obsCF[3],q.ngenes])
push!(df, [q.taxon[1],q.taxon[2],q.taxon[3],q.taxon[4],q.obsCF[1],q.obsCF[2],q.obsCF[3],
(q.ngenes==-1.0 ? missing : q.ngenes)])
end
if all(ismissing, df[:ngenes])
deletecols!(df, :ngenes)
end
return df
end

writeObsCF(d::DataCF) = writeObsCF(d.quartet)
writeTableCF(d::DataCF) = writeTableCF(d.quartet)

"""
readTableCF(file)
Expand Down Expand Up @@ -93,7 +106,7 @@ function readTableCF!(df::DataFrames.DataFrame; summaryfile=""::AbstractString)

d = readTableCF!(df, columns)

if withngenes && d.numTrees == -1
if withngenes # && d.numTrees == -1
m1 = minimum([q.ngenes for q in d.quartet])
m2 = maximum([q.ngenes for q in d.quartet])
if m1<m2 print("between $m1 and ") end
Expand All @@ -120,7 +133,7 @@ function readTableCF!(df::DataFrames.DataFrame, co::Vector{Int})
for i in 1:size(df,1)
push!(quartets,Quartet(i,string(df[i,co[1]]),string(df[i,co[2]]),string(df[i,co[3]]),string(df[i,co[4]]),
[df[i,co[5]],df[i,co[6]],df[i,co[7]]]))
if(withngenes)
if withngenes
quartets[end].ngenes = df[i,co[8]]
end
end
Expand Down Expand Up @@ -563,7 +576,7 @@ function readInputData(trees::Vector{HybridNetwork}, quartetfile::AbstractString
with readTableCF(\"$(filename)\")""")
end
println("\ntable of obsCF printed to file $(filename)")
df = writeObsCF(d)
df = writeTableCF(d)
CSV.write(filename,df)
end
#descData(d,"summaryTreesQuartets$(string(integer(time()/1000))).txt")
Expand Down Expand Up @@ -620,7 +633,7 @@ function readInputData(trees::Vector{HybridNetwork}, whichQ::Symbol, numQ::Integ
filename = "tableCF.txt"
end
println("table of obsCF printed to file $(filename)")
df = writeObsCF(d)
df = writeTableCF(d)
CSV.write(filename,df)
end
#descData(d,"summaryTreesQuartets$(string(integer(time()/1000))).txt")
Expand Down Expand Up @@ -720,20 +733,20 @@ end
# pc: only 4-taxon subsets with percentage of gene trees less than pc will be printed (default 70%)
function descData(d::DataCF, sout::IO, pc::Float64)
0<=pc<=1 || error("percentage of missing genes should be between 0,1, not: $(pc)")
if(!isempty(d.tree))
if !isempty(d.tree)
print(sout,"data consists of $(d.numTrees) gene trees and $(d.numQuartets) 4-taxon subsets\n")
taxaTreesQuartets(d.tree,d.quartet,sout)
print(sout,"----------------------------\n\n")
print(sout,"will print below only the 4-taxon subsets with data from <= $(round((pc)*100, digits=2))% genes\n")
for q in d.quartet
percent = q.ngenes == -1 ? 0.0 : round(q.ngenes/d.numTrees*100, digits=2)
if(percent < pc)
print(sout,"4-taxon subset $(q.taxon) obsCF constructed with $(q.ngenes) gene trees ($(percent)%)\n")
percent = q.ngenes == -1.0 ? 0.0 : round(q.ngenes/d.numTrees*100, digits=2)
if percent < pc
print(sout,"4-taxon subset $(q.taxon) obsCF constructed with $(round(q.ngenes)) gene trees ($(percent)%)\n")
end
end
print(sout,"----------------------------\n\n")
else
if(!isempty(d.quartet))
if !isempty(d.quartet)
print(sout,"data consists of $(d.numQuartets) 4-taxon subsets")
taxa=unionTaxa(d.quartet)
print(sout,"\nTaxa: $(taxa)\n")
Expand Down
12 changes: 6 additions & 6 deletions src/types.jl
Original file line number Diff line number Diff line change
Expand Up @@ -288,7 +288,7 @@ type that saves the information on a given 4-taxon subset. It contains the follo
- taxon: vector of taxon names, like t1 t2 t3 t4
- obsCF: vector of observed CF, in order 12|34, 13|24, 14|23
- logPseudoLik
- ngenes: number of gene trees used to compute the observed CF; -1 if unknown
- ngenes: number of gene trees used to compute the observed CF; -1.0 if unknown
- qnet: [`QuartetNetwork`](@ref), which saves the expCF after snaq estimation to
emphasize that the expCF depend on a specific network, not the data
"""
Expand All @@ -297,13 +297,13 @@ mutable struct Quartet
taxon::Array{String,1} # taxa 1234. qnet.quartetTaxon points to the same array.
obsCF::Array{Float64,1} # three observed CF in order 12|34, 13|24, 14|23
qnet::QuartetNetwork # quartet network for the current network (want to keep as if private attribute)
logPseudoLik::Float64 # log pseudolik value for the quartet
ngenes::Number # number of gene trees used to compute the obsCV, default -1; changed to Number in case ngenes is average, so float
logPseudoLik::Float64 # log pseudolik value for the quartet. 0.0 by default
ngenes::Float64 # number of gene trees used to compute the obsCV, default -1.; Float in case ngenes is average
# inner constructor: to guarantee obsCF are only three and add up to 1
function Quartet(number::Integer,t1::AbstractString,t2::AbstractString,t3::AbstractString,t4::AbstractString,obsCF::Array{Float64,1})
size(obsCF,1) != 3 ? error("observed CF vector should have size 3, not $(size(obsCF,1))") : nothing
0.99 < sum(obsCF) < 1.02 || @warn "observed CF should add up to 1, not $(sum(obsCF))"
new(number,[t1,t2,t3,t4],obsCF,QuartetNetwork(),0,-1);
new(number,[t1,t2,t3,t4],obsCF,QuartetNetwork(),0.0,-1.0);
end
function Quartet(number::Integer,t1::Array{String,1},obsCF::Array{Float64,1})
size(obsCF,1) != 3 ? error("observed CF vector should have size 3, not $(size(obsCF,1))") : nothing
Expand All @@ -312,9 +312,9 @@ mutable struct Quartet
0.0 <= obsCF[1] <= 1.0 || error("obsCF must be between (0,1), but it is $(obsCF[1]) for $(t1)")
0.0 <= obsCF[2] <= 1.0 || error("obsCF must be between (0,1), but it is $(obsCF[2]) for $(t1)")
0.0 <= obsCF[3] <= 1.0 || error("obsCF must be between (0,1), but it is $(obsCF[3]) for $(t1)")
new(number,t1,obsCF,QuartetNetwork(),0,-1);
new(number,t1,obsCF,QuartetNetwork(),0.0,-1.0);
end
Quartet() = new(0,[],[],QuartetNetwork(),0,-1)
Quartet() = new(0,[],[],QuartetNetwork(),0.0,-1.0)
end

# Data -------
Expand Down
16 changes: 13 additions & 3 deletions test/test_correctLik.jl
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,19 @@ df=DataFrame(t1=["6","6","10","6","6"],
obsCF14=[0.2729102510259939, 0.30161247267865315, 0.3967750546426937, 0.24693940689390592, 0.2729102510259939])
d = readTableCF(df)
@test_throws ErrorException PhyloNetworks.writeExpCF(d)
@test PhyloNetworks.writeObsCF(d)[1:7] == rename(df, [:obsCF12 => :CF12_34, :obsCF13 => :CF13_24, :obsCF14 => :CF14_23])
@test writeTableCF(d) == rename(df, [:obsCF12 => :CF12_34, :obsCF13 => :CF13_24, :obsCF14 => :CF14_23])
@test tipLabels(d) == ["4","6","7","8","10"]
@test_logs PhyloNetworks.descData(d, devnull)

df[:ngenes] = [10,10,10,10,20]
allowmissing!(df, :ngenes)
d = readTableCF(df)
df[:ngenes][1] = missing; d.quartet[1].ngenes = -1.0
newdf = writeTableCF(d)
@test newdf[1:7] == rename(df, [:obsCF12 => :CF12_34, :obsCF13 => :CF13_24, :obsCF14 => :CF14_23])[1:7]
@test ismissing(newdf[:ngenes][1])
@test newdf[:ngenes][2:end] == df[:ngenes][2:end]

# starting tree:
tree = "((6,4),(7,8),10);"
currT = readTopologyLevel1(tree);
Expand Down Expand Up @@ -71,8 +80,9 @@ end
global net = readTopology("((((6:0.1,4:1.5)1:0.2,((7,60))11#H1)5:0.1,(11#H1,8)),10:0.1);")
@test_logs (:warn, r"^these taxa will be deleted") snaq!(net, d, # taxon "60" in net: not in quartets
hmax=1, runs=1, Nfail=1, seed=1234, ftolRel=1e-2,ftolAbs=1e-2,xtolAbs=1e-2,xtolRel=1e-2)
global n1 = snaq!(currT, d, hmax=1, runs=2, Nfail=1, seed=1234,
ftolRel=1e-2,ftolAbs=1e-2,xtolAbs=1e-2,xtolRel=1e-2)
global n1 = snaq!(currT, d, hmax=1, runs=1, Nfail=1, seed=1234,
ftolRel=1e-2,ftolAbs=1e-2,xtolAbs=1e-2,xtolRel=1e-2,
verbose=true)
addprocs(1)
@everywhere using PhyloNetworks
global n2 = snaq!(currT, d, hmax=1, runs=2, Nfail=1, seed=1234,
Expand Down

0 comments on commit 61cd29f

Please sign in to comment.