Skip to content

Commit

Permalink
Merge 61cd29f into daeffdc
Browse files Browse the repository at this point in the history
  • Loading branch information
cecileane authored Jan 14, 2019
2 parents daeffdc + 61cd29f commit cdd9fba
Show file tree
Hide file tree
Showing 11 changed files with 141 additions and 70 deletions.
Binary file added docs/src/assets/logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/src/lib/public.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ readSnaqNetwork
readTrees2CF
readTableCF
readTableCF!
writeTableCF
readBootstrapTrees
writeSubTree!
writeTopology
Expand Down
13 changes: 7 additions & 6 deletions docs/src/man/inputdata.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,13 +75,14 @@ nothing # hide
To read in all gene trees and directly summarize them by a list
of quartet CFs (proportion of input trees with a given quartet):
```@repl qcf
raxmlCF = readTrees2CF(raxmltrees, CFfile="tableCF.txt");
PhyloNetworks.writeObsCF(raxmlCF) # observed CFs: gene frequencies
rm("tableCF.txt") # hide
raxmlCF = readTrees2CF(raxmltrees, CFfile="tableCF.csv");
df = writeTableCF(raxmlCF) # data frame with observed CFs: gene frequencies
CSV.write("tableCF.csv", df) # to save the data frame to a file
rm("tableCF.csv") # hide
rm("summaryTreesQuartets.txt") # hide
```
`less("tableCF.txt")` lets you see the content of the newly created
file "tableCF.txt", within Julia. Again, type `q` to quit viewing this file.
`less("tableCF.csv")` lets you see the content of the newly created
file "tableCF.csv", within Julia. Again, type `q` to quit viewing this file.

In this table, each 4-taxon set is listed in one row.
The 3 "CF" columns gives the proportion of genes that has
Expand Down Expand Up @@ -132,7 +133,7 @@ using CSV, DataFrames
dat = CSV.read(buckyCFfile);
first(dat, 6) # to see the first 6 rows
buckyCF = readTableCF(dat)
PhyloNetworks.writeObsCF(buckyCF)
writeTableCF(buckyCF)
```
In the input file, columns need to be in the right order:
with the first 4 columns giving the names of the taxa in each 4-taxon set.
Expand Down
44 changes: 43 additions & 1 deletion docs/src/man/multiplealleles.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,56 @@ species level: with the allele names replaced by the appropriate species names.
The second command modifies this data frame `df_sp` by deleting rows that are uninformative
about between-species relationships, such as rows corresponding to 4 individuals from the
same species. The output `d_sp` of this second command is an object of type `DataCF` at the
species level, which can be used as input for networks estimation with `snaq!`:
species level, which can be used as input for networks estimation with `snaq!`.
But before, it is safe to save the concordance factor of quartets of species,
which can be calculated by averaging the CFs of quartets of individuals
from the associated species:

```julia
df_sp = writeTableCF(d_sp) # data frame, quartet CFs averaged across individuals of same species
CSV.write("CFtable_species.csv", df_sp) # save to file
```

Some quartets have the same species repeated twice,
representing cases when 2 of the 4 individuals came from the same species.
These quartets, with repeated species, are informative about the population
size of extant populations, i.e. about the lengths of external branches in
coalescent units.

now we can run snaq:

```julia
net = snaq!(T_sp, d_sp);
```
where `T_sp` should be a starting topology with one tip per species,
labelled with the same species names as the names used in the mapping file.

If `snaq!` takes too long that way, we can try a less ambitious estimation
that does not estimate the external branch lengths, that is,
*without* using quartets that have 2 individuals from the same species.
To do so, we can use the quartet concordance factors at the species level,
but filter out the quartets with one (or more) species repeated:

```julia
df_sp = writeTableCF(d_sp) # some quartets have the same species twice
function hasrep(row) # see if a row (4-taxon set) has a species name ending with "__2": repeated species
occursin(r"__2$", row[:tx1]) || occursin(r"__2$", row[:tx2]) ||
occursin(r"__2$", row[:tx3]) || occursin(r"__2$", row[:tx4])
end
df_sp_reduced = filter(!hasrep, df_sp) # removes rows with repeated species
df_sp_reduced # should have fewer rows than df_sp
CSV.write("CFtable_species_norep.csv", df_sp_reduced) # to save to file
d_sp_reduced = readTableCF(df_sp_reduced) # DataCF object, for input to snaq!
```

and now we can run `snaq!` on the reduced set of quartets without repeats,
which should be faster:

```julia
net = snaq!(T_sp, d_sp_reduced);
```


Warnings:

- This feature has not been fully tested
Expand Down
2 changes: 2 additions & 0 deletions src/PhyloNetworks.jl
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ module PhyloNetworks
readTrees2CF,
readTableCF,
readTableCF!,
writeTableCF,
readInputTrees,
readNexusTrees,
summarizeDataCF,
Expand Down Expand Up @@ -164,5 +165,6 @@ module PhyloNetworks
include("biconnectedComponents.jl")
include("traitsLikDiscrete.jl")
include("ticr.jl")
include("deprecated.jl")

end #module
1 change: 1 addition & 0 deletions src/deprecated.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
@deprecate writeObsCF writeTableCF
2 changes: 1 addition & 1 deletion src/multipleAlleles.jl
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ function mapAllelesCFtable!(cfDF::DataFrame, alleleDF::DataFrame, co::Vector{Int
end

# function to clean a df after changing allele names to species names
# inside mapAllelesCFtable
# inside readTableCF!
# by deleting rows that are not informative like sp1 sp1 sp1 sp2
# keepOne=true: we only keep one allele per species
function cleanAlleleDF!(newdf::DataFrame, cols::Vector{Int};keepOne=false::Bool)
Expand Down
4 changes: 2 additions & 2 deletions src/optimization.jl
Original file line number Diff line number Diff line change
Expand Up @@ -398,8 +398,8 @@ function optBL!(net::HybridNetwork, d::DataCF, verbose::Bool, ftolRel::Float64,
if verbose println("OPTBL: starting point $(ht)") # to stdout
else @debug "OPTBL: starting point $(ht)"; end # to logger if debug turned on by user
fmin, xmin, ret = NLopt.optimize(opt,ht)
if verbose println("got $(round(fmin, digits=5)) at $(round(xmin, digits=5)) after $(count) iterations (returned $(ret))")
else @debug "got $(round(fmin, digits=5)) at $(round(xmin, digits=5)) after $(count) iterations (returned $(ret))"; end
if verbose println("got $(round(fmin, digits=5)) at $(round.(xmin, digits=5)) after $(count) iterations (returned $(ret))")
else @debug "got $(round(fmin, digits=5)) at $(round.(xmin, digits=5)) after $(count) iterations (returned $(ret))"; end
updateParameters!(net,xmin)
net.loglik = fmin
#return fmin,xmin
Expand Down
116 changes: 65 additions & 51 deletions src/readData.jl
Original file line number Diff line number Diff line change
Expand Up @@ -14,20 +14,33 @@ end

writeExpCF(d::DataCF) = writeExpCF(d.quartet)

# function to write a csv table from the obsCF of an
# array of quartets
function writeObsCF(quartets::Array{Quartet,1})
"""
writeTableCF(vector of quartets)
writeTableCF(DataCF)
Build a DataFrame containing observed quartet concordance factors,
with columns named:
- `:tx1`, `:tx2`, `:tx3`, `:tx4` for the four taxon names in each quartet
- `:CF12_34`, `:CF13_24`, `:CF14_23` for the 3 quartets of a given four-taxon set
- `:ngenes` if this information is available for some quartets
"""
function writeTableCF(quartets::Array{Quartet,1})
df = DataFrames.DataFrame(t1=String[],t2=String[],t3=String[],t4=String[],
CF12_34=Float64[],CF13_24=Float64[],CF14_23=Float64[],ngenes=Int[])
CF12_34=Float64[],CF13_24=Float64[],CF14_23=Float64[],
ngenes=Union{Missing, Float64}[])
for q in quartets
length(q.taxon) == 4 || error("quartet $(q.number) does not have 4 taxa")
length(q.obsCF) == 3 || error("quartet $(q.number) does have qnet with 3 expCF")
push!(df, [q.taxon[1],q.taxon[2],q.taxon[3],q.taxon[4],q.obsCF[1],q.obsCF[2],q.obsCF[3],q.ngenes])
push!(df, [q.taxon[1],q.taxon[2],q.taxon[3],q.taxon[4],q.obsCF[1],q.obsCF[2],q.obsCF[3],
(q.ngenes==-1.0 ? missing : q.ngenes)])
end
if all(ismissing, df[:ngenes])
deletecols!(df, :ngenes)
end
return df
end

writeObsCF(d::DataCF) = writeObsCF(d.quartet)
writeTableCF(d::DataCF) = writeTableCF(d.quartet)

"""
readTableCF(file)
Expand All @@ -38,7 +51,9 @@ Read a file or DataFrame object containing a table of concordance factors (CF),
with one row per 4-taxon set. The first 4 columns are assumed to give the labels
of the 4 taxa in each set (tx1, tx2, tx3, tx4).
Columns containing the CFs are assumed to be named
'CF12_34', 'CF13_24' and 'CF14_23', or else are assumed to be columns 5,6,7.
`CF12_34`, `CF13_24` and `CF14_23`;
or `CF12.34`, `CF13.24` and `CF14.23`;
or else are assumed to be columns 5,6,7.
If present, a column named 'ngenes' will be used to get the number of loci
used to estimate the CFs for each 4-taxon set.
Expand Down Expand Up @@ -91,7 +106,7 @@ function readTableCF!(df::DataFrames.DataFrame; summaryfile=""::AbstractString)

d = readTableCF!(df, columns)

if withngenes && d.numTrees == -1
if withngenes # && d.numTrees == -1
m1 = minimum([q.ngenes for q in d.quartet])
m2 = maximum([q.ngenes for q in d.quartet])
if m1<m2 print("between $m1 and ") end
Expand All @@ -118,7 +133,7 @@ function readTableCF!(df::DataFrames.DataFrame, co::Vector{Int})
for i in 1:size(df,1)
push!(quartets,Quartet(i,string(df[i,co[1]]),string(df[i,co[2]]),string(df[i,co[3]]),string(df[i,co[4]]),
[df[i,co[5]],df[i,co[6]],df[i,co[7]]]))
if(withngenes)
if withngenes
quartets[end].ngenes = df[i,co[8]]
end
end
Expand Down Expand Up @@ -479,14 +494,40 @@ function calculateObsCFAll_noDataCF!(quartets::Vector{Quartet}, trees::Vector{Hy
return nothing
end

# function to read input list of gene trees/quartets and calculates obsCF
# as opposed to readTableCF that read the table of obsCF directly
# input: treefile (with gene trees), quartetfile (with list of quartets),
# whichQ (:add/:rand to decide if all or random sample of quartets, default all)
# numQ: number of quartets in random sample
# writetab = true to write the table of obsCF as file with name filename
# does it by default
# writeFile=true writes file with sampled quartets, default false
"""
readInputData(trees, quartetfile, whichQuartets, numQuartets, writetable, tablename, writeQfile, writesummary)
readInputData(trees, whichQuartets, numQuartets, taxonlist, writetable, tablename, writeQfile, writesummary)
Read gene trees and calculate the observed quartet concordance factors (CF),
that is, the proportion of genes (and the number of genes) that display each
quartet for a given list of four-taxon sets.
Input:
- `trees`: name of a file containing a list of input gene trees,
or vector of trees (`HybridNetwork` objects)
Optional arguments (defaults):
- `quartetfile`: name of a file containing a list of quartets, or more precisely,
a list of four-taxon sets
- `whichQuartets` (`:all`): which quartets to sample.
`:all` for all of them, `:rand` for a random sample.
- `numQuartets`: number of quartets in the sample.
default: total number of quartets if `whichQuartets=:all`
and 10% of total if `whichQuartets=:rand`
- `taxonlist` (all in the input gene trees):
If `taxonlist` is used, `whichQuartets` will consist of *all* sets of 4 taxa in the `taxonlist`.
- `writetable` (true): write the table of observed CF?
- `tablename` ("tableCF.txt"): if `writetable` is true, the table of observed CFs is write to file `tablename`
- `writeQfile` (false): write intermediate file with sampled quartets?
- `writesummary` (true): write a summary file?
if so, the summary will go in file "summaryTreesQuartets.txt".
See also:
[`readTrees2CF`](@ref), which is basically a re-naming of `readInputData`, and
[`readTableCF`](@ref) to read a table of quartet CFs directly.
"""
function readInputData(treefile::AbstractString, quartetfile::AbstractString, whichQ::Symbol, numQ::Integer, writetab::Bool, filename::AbstractString, writeFile::Bool, writeSummary::Bool)
if writetab
if(filename == "none")
Expand All @@ -508,14 +549,6 @@ readInputData(treefile::AbstractString, quartetfile::AbstractString, whichQ::Sym
##readInputData(treefile::AbstractString, quartetfile::AbstractString) = readInputData(treefile, quartetfile, :all, 0, true, "none", false, true)
readInputData(treefile::AbstractString, quartetfile::AbstractString, writetab::Bool, filename::AbstractString) = readInputData(treefile, quartetfile, :all, 0, writetab, filename, false, true)

# function to read input list of gene trees/quartets and calculates obsCF
# as opposed to readTableCF that read the table of obsCF directly
# input: trees Vector of HybridNetwork, quartetfile (with list of quartets),
# whichQ (:add/:rand to decide if all or random sample of quartets, default all)
# numQ: number of quartets in random sample
# writetab = true to write the table of obsCF as file with name filename
# does it by default
# writeFile=true writes file with sampled quartets, default false
function readInputData(trees::Vector{HybridNetwork}, quartetfile::AbstractString, whichQ::Symbol, numQ::Integer, writetab::Bool, filename::AbstractString, writeFile::Bool, writeSummary::Bool)
if(whichQ == :all)
numQ == 0 || @warn "set numQ=$(numQ) but whichQ is not rand, so all quartets will be used and numQ will be ignored. If you want a specific number of 4-taxon subsets not random, you can input with the quartetfile option"
Expand Down Expand Up @@ -543,7 +576,7 @@ function readInputData(trees::Vector{HybridNetwork}, quartetfile::AbstractString
with readTableCF(\"$(filename)\")""")
end
println("\ntable of obsCF printed to file $(filename)")
df = writeObsCF(d)
df = writeTableCF(d)
CSV.write(filename,df)
end
#descData(d,"summaryTreesQuartets$(string(integer(time()/1000))).txt")
Expand All @@ -552,16 +585,6 @@ function readInputData(trees::Vector{HybridNetwork}, quartetfile::AbstractString
end



# function to read input list of gene trees, and not the list of quartets
# so it creates the list of quartets inside and calculates obsCF
# as opposed to readTableCF that read the table of obsCF directly
# input: treefile (with gene trees), whichQ (:add/:rand to decide if all or random sample of quartets, default all)
# numQ: number of quartets in random sample
# taxa: list of taxa, if not given, all taxa in gene trees used
# writetab = true to write the table of obsCF as file with name filename
# does it by default
# writeFile= true, writes intermediate files with the quartets info (default false)
function readInputData(treefile::AbstractString, whichQ::Symbol, numQ::Integer, taxa::Union{Vector{String}, Vector{Int}}, writetab::Bool, filename::AbstractString, writeFile::Bool, writeSummary::Bool)
if writetab
if(filename == "none")
Expand Down Expand Up @@ -589,15 +612,6 @@ readInputData(treefile::AbstractString,taxa::Union{Vector{String}, Vector{Int}})
# is not good: need to read the tree file twice: get the taxa, then get the trees
# this inefficiency was fixed in readTrees2CF

# function to read input vector of HybridNetworks, and not the list of quartets
# so it creates the list of quartets inside and calculates obsCF
# as opposed to readTableCF that read the table of obsCF directly
# input: trees (Vector of HybridNetwork), whichQ (:add/:rand to decide if all or random sample of quartets, default all)
# numQ: number of quartets in random sample
# taxa: list of taxa, if not given, all taxa in gene trees used
# writetab = true to write the table of obsCF as file with name filename
# does it by default
# writeFile= true, writes intermediate files with the quartets info (default false)
function readInputData(trees::Vector{HybridNetwork}, whichQ::Symbol, numQ::Integer, taxa::Union{Vector{String}, Vector{Int}}, writetab::Bool, filename::AbstractString, writeFile::Bool, writeSummary::Bool)
if(whichQ == :all)
numQ == 0 || @warn "set numQ=$(numQ) but whichQ=all, so all quartets will be used and numQ will be ignored. If you want a specific number of 4-taxon subsets not random, you can input with the quartetfile option"
Expand All @@ -619,7 +633,7 @@ function readInputData(trees::Vector{HybridNetwork}, whichQ::Symbol, numQ::Integ
filename = "tableCF.txt"
end
println("table of obsCF printed to file $(filename)")
df = writeObsCF(d)
df = writeTableCF(d)
CSV.write(filename,df)
end
#descData(d,"summaryTreesQuartets$(string(integer(time()/1000))).txt")
Expand Down Expand Up @@ -719,20 +733,20 @@ end
# pc: only 4-taxon subsets with percentage of gene trees less than pc will be printed (default 70%)
function descData(d::DataCF, sout::IO, pc::Float64)
0<=pc<=1 || error("percentage of missing genes should be between 0,1, not: $(pc)")
if(!isempty(d.tree))
if !isempty(d.tree)
print(sout,"data consists of $(d.numTrees) gene trees and $(d.numQuartets) 4-taxon subsets\n")
taxaTreesQuartets(d.tree,d.quartet,sout)
print(sout,"----------------------------\n\n")
print(sout,"will print below only the 4-taxon subsets with data from <= $(round((pc)*100, digits=2))% genes\n")
for q in d.quartet
percent = q.ngenes == -1 ? 0.0 : round(q.ngenes/d.numTrees*100, digits=2)
if(percent < pc)
print(sout,"4-taxon subset $(q.taxon) obsCF constructed with $(q.ngenes) gene trees ($(percent)%)\n")
percent = q.ngenes == -1.0 ? 0.0 : round(q.ngenes/d.numTrees*100, digits=2)
if percent < pc
print(sout,"4-taxon subset $(q.taxon) obsCF constructed with $(round(q.ngenes)) gene trees ($(percent)%)\n")
end
end
print(sout,"----------------------------\n\n")
else
if(!isempty(d.quartet))
if !isempty(d.quartet)
print(sout,"data consists of $(d.numQuartets) 4-taxon subsets")
taxa=unionTaxa(d.quartet)
print(sout,"\nTaxa: $(taxa)\n")
Expand Down
12 changes: 6 additions & 6 deletions src/types.jl
Original file line number Diff line number Diff line change
Expand Up @@ -288,7 +288,7 @@ type that saves the information on a given 4-taxon subset. It contains the follo
- taxon: vector of taxon names, like t1 t2 t3 t4
- obsCF: vector of observed CF, in order 12|34, 13|24, 14|23
- logPseudoLik
- ngenes: number of gene trees used to compute the observed CF; -1 if unknown
- ngenes: number of gene trees used to compute the observed CF; -1.0 if unknown
- qnet: [`QuartetNetwork`](@ref), which saves the expCF after snaq estimation to
emphasize that the expCF depend on a specific network, not the data
"""
Expand All @@ -297,13 +297,13 @@ mutable struct Quartet
taxon::Array{String,1} # taxa 1234. qnet.quartetTaxon points to the same array.
obsCF::Array{Float64,1} # three observed CF in order 12|34, 13|24, 14|23
qnet::QuartetNetwork # quartet network for the current network (want to keep as if private attribute)
logPseudoLik::Float64 # log pseudolik value for the quartet
ngenes::Number # number of gene trees used to compute the obsCV, default -1; changed to Number in case ngenes is average, so float
logPseudoLik::Float64 # log pseudolik value for the quartet. 0.0 by default
ngenes::Float64 # number of gene trees used to compute the obsCV, default -1.; Float in case ngenes is average
# inner constructor: to guarantee obsCF are only three and add up to 1
function Quartet(number::Integer,t1::AbstractString,t2::AbstractString,t3::AbstractString,t4::AbstractString,obsCF::Array{Float64,1})
size(obsCF,1) != 3 ? error("observed CF vector should have size 3, not $(size(obsCF,1))") : nothing
0.99 < sum(obsCF) < 1.02 || @warn "observed CF should add up to 1, not $(sum(obsCF))"
new(number,[t1,t2,t3,t4],obsCF,QuartetNetwork(),0,-1);
new(number,[t1,t2,t3,t4],obsCF,QuartetNetwork(),0.0,-1.0);
end
function Quartet(number::Integer,t1::Array{String,1},obsCF::Array{Float64,1})
size(obsCF,1) != 3 ? error("observed CF vector should have size 3, not $(size(obsCF,1))") : nothing
Expand All @@ -312,9 +312,9 @@ mutable struct Quartet
0.0 <= obsCF[1] <= 1.0 || error("obsCF must be between (0,1), but it is $(obsCF[1]) for $(t1)")
0.0 <= obsCF[2] <= 1.0 || error("obsCF must be between (0,1), but it is $(obsCF[2]) for $(t1)")
0.0 <= obsCF[3] <= 1.0 || error("obsCF must be between (0,1), but it is $(obsCF[3]) for $(t1)")
new(number,t1,obsCF,QuartetNetwork(),0,-1);
new(number,t1,obsCF,QuartetNetwork(),0.0,-1.0);
end
Quartet() = new(0,[],[],QuartetNetwork(),0,-1)
Quartet() = new(0,[],[],QuartetNetwork(),0.0,-1.0)
end

# Data -------
Expand Down
Loading

0 comments on commit cdd9fba

Please sign in to comment.