In [1]:
library(gaia)
library(tidyr)
library(dplyr)
library(data.table)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



Attaching package: ‘data.table’


The following objects are masked from ‘package:dplyr’:

    between, first, last




In [2]:
#?treeseq_discrete_mpr

## read tree sequence

In [3]:
filename = "demo.trees"

ts = treeseq_load(filename)

nodes = treeseq_nodes(ts)

edges = treeseq_edges(ts)

# view first local tree 
tree = treeseq_to_phylo(ts)

## extract sample nodes, locations

In [4]:
# identify sample nodes
samples <- subset(nodes, is_sample == 1L)

# use population_id as geography; drop missing-like values (e.g., -1)
ok       <- samples$population_id >= 0
samples  <- samples[ok, , drop = FALSE]

# remap unique population ids to 1..N_DEMES
pop_levels <- sort(unique(samples$population_id))
state_map  <- setNames(seq_along(pop_levels), pop_levels)
state_id   <- unname(state_map[ as.character(samples$population_id) ])

# georef table for gaia w/ node_id, state_id cols
sample_locations <- data.frame(
  node_id  = samples$node_id,
  state_id = state_id +1 # Make sure sample locations are 1-based (curr 0-based)
)

sample_locations <- as.matrix(sample_locations[, c("node_id","state_id")])
storage.mode(sample_locations) <- "integer"
colnames(sample_locations) <- c("node_id","state_id")

In [5]:
sample_locations

node_id,state_id
0,2
1,2
2,2
3,2
4,2
5,2
6,2
7,2
8,2
9,2


## define cost matrix 

In [6]:
cost.mat = data.matrix(read.csv("distmat.csv", row.names=1))[1:3, 1:3]
dimnames(cost.mat) = NULL
cost.mat

“incomplete final line found by readTableHeader on 'distmat.csv'”


0,1,2
0.0,0.0,0.0
0.0003,0.0,0.0001
0.0003,0.0001,0.0


## run mpr function 

In [7]:
?treeseq_discrete_mpr

0,1
treeseq_discrete_mpr {gaia},R Documentation

0,1
ts,"A treeseq object, typically loaded via treeseq_load"
sample_locations,An integer matrix with two columns: node_idNode identifiers for sampled genomes (0-based indexing) state_idGeographic state assignments for samples (1-based indexing)
cost_matrix,"A symmetric numeric matrix where entry [i,j] gives the migration cost between states i and j. Must have non-negative values. Diagonal elements (representing costs of remaining in the same state) are ignored."
use_brlen,"Logical indicating whether to scale migration costs by inverse branch lengths (TRUE) or treat all branches equally (FALSE, default)"


In [8]:
?treeseq_discrete_mpr_minimize

0,1
treeseq_discrete_mpr_minimize {gaia},R Documentation

0,1
obj,Result object from treeseq_discrete_mpr
index1,"Logical indicating whether returned state assignments should use 1-based indexing (TRUE, default) or 0-based indexing (FALSE)"


In [9]:
mpr = treeseq_discrete_mpr(ts, sample_locations, cost.mat)
estimated_node_states = treeseq_discrete_mpr_minimize(mpr) 

In [10]:
estimated_node_states ## node states for the entire tree sequence. 

None of the internal nodes are labeled as belonging to pop_0 (source population). 
This is incorrect, but makes sense given that none of my samples (tips) came from pop_0. 
Method applies DTA on per-tree basis. For nodes that appear across multiple trees, the cheapest average location is chosen. 

In [11]:
node_states_df = data.frame(node_time=nodes$time, node_state=nodes$population_id+1L, 
    estimated_node_state=estimated_node_states)

node_states_df

node_time,node_state,estimated_node_state
<dbl>,<int>,<int>
0.0,2,2
0.0,2,2
0.0,2,2
0.0,2,2
0.0,2,2
0.0,2,2
0.0,2,2
0.0,2,2
0.0,2,2
0.0,2,2


## get per-tree mrca states

In [None]:
plot 