# Markov Chain Monte Carlo for Trees using a Mamba extension

Typical data format in phylogenetic linguistics

|Language | Mountain | You | ...
|---------|------|-----|-----
|Swedish  | 1 |  1  | ...
|Norwegian| 1 |  1  | ...
|Italian  | 2 |  1  | ...

Such data is used for phylogenetic inference using Bayesian methods. Based on an Markov Process wich describes the untderlying evolutionary process of character evolution Markov Chain Monte Carlo Methods can be used to estimate a posterior of phylogenetic trees. Using such (or other) trees, further statistical questions can be asked. Several of these questions are based on a statistical model whose parameters need to be infered using Markov Chain Monte Carlo methods. This requires a flexible framework which can be used to define these models. Additionally, inference should be relatively fast.

The Julia programming language is a new language which aims to be high performance and easy to write. This makes it a good starting point to develop a system which can be used by many to define models for their needs but also efficient in calculating these models. 

The [Mamba package](https://mambajl.readthedocs.io/en/latest/intro.html) offers a good starting point to develop an infrastructure for said tasks using the Julia Programming Language.

In [1]:
include("../MCPhylo/src/MCPhylo.jl");
using .MCPhylo;
using Random;
using GraphIO
Random.seed!(1234);

┌ Info: Loading DataFrames support into Gadfly.jl
└ @ Gadfly /home/johannes/.julia/packages/Gadfly/09PWZ/src/mapping.jl:228


As a first step read in the data from the data file. This function creates a tree object (`m_tree`) which is a random binary tree with the languages specified in the data fiel as the leaves and an array object, which stores the character information.

In [2]:
m_tree, df = make_tree_with_data("../local/development.nex"); # load your own nexus file

In a next step all the relevant information needs to be stored in a dictionary so they can be used later on. The entries `:nnodes`, `:nsites` and `:nbase` store the the dimensions of the data array. Additionally the data array is log transformed. The log transformation is necessary for the likelihood computation.

In [3]:
my_data = Dict{Symbol, Any}(
  :mtree => m_tree,
  :df => log.(df),
  :nnodes => size(df)[1],
  :nbase => size(df)[2],
  :nsites => size(df)[3])


Dict{Symbol,Any} with 5 entries:
  :df     => [0.0 -Inf; 0.0 -Inf; … ; -Inf -Inf; -Inf -Inf]…
  :nnodes => 17
  :mtree  => "17"
  :nsites => 3132
  :nbase  => 2

The important part is the model definition. The idea is to define a model in terms of a graph. The graph represents the model and thus the explicit relationships between the parameters.

In [8]:
# model setup
model =  Model(
    df = Stochastic(3,
    (mtree, mypi, rates, nnodes, nbase, nsites) -> PhyloDist(mtree, mypi, rates, nnodes, nbase, nsites), false
    ),
    mypi = Stochastic( () -> Uniform(0.0,1.0)),
    mtree = Stochastic(Node(), () -> CompoundDirichlet(1.0,1.0,0.100,1.0), true),
    rates = Logical(1,(mymap, av) -> [av[convert(UInt8,i)] for i in mymap],false),
    mymap = Stochastic(1,() -> Categorical([0.25, 0.25, 0.25, 0.25]), false),
    av = Stochastic(1,() -> Dirichlet([1.0, 1.0, 1.0, 1.0]))
     )


Object of type "Model"
-------------------------------------------------------------------------------
mymap:
Object of type "0-element ArrayStochastic{1}"
Float64[]
-------------------------------------------------------------------------------
av:
Object of type "0-element ArrayStochastic{1}"
Float64[]
-------------------------------------------------------------------------------
df:
Object of type "0×0×0 ArrayStochastic{3}"
Array{Float64}(undef,0,0,0)
-------------------------------------------------------------------------------
rates:
Object of type "0-element ArrayLogical{1}"
Float64[]
-------------------------------------------------------------------------------
mtree:
Object of type "Main.MCPhylo.TreeStochastic"
"noname"
-------------------------------------------------------------------------------
mypi:
Object of type "ScalarStochastic"
NaN


In [12]:
draw(model)

digraph MambaModel {
	"mymap" [shape="ellipse", style="filled", fillcolor="gray85"];
		"mymap" -> "rates";
	"av" [shape="ellipse"];
		"av" -> "rates";
	"df" [shape="ellipse", style="filled", fillcolor="gray85"];
	"mtree" [shape="ellipse"];
		"mtree" -> "df";
	"mypi" [shape="ellipse"];
		"mypi" -> "df";
	"rates" [shape="diamond", style="filled", fillcolor="gray85"];
		"rates" -> "df";
	"nnodes" [shape="box", style="filled", fillcolor="gray85"];
		"nnodes" -> "df";
	"nbase" [shape="box", style="filled", fillcolor="gray85"];
		"nbase" -> "df";
	"nsites" [shape="box", style="filled", fillcolor="gray85"];
		"nsites" -> "df";
}


[title](my_graph.dot.pdf)