# Extracting results from Barido-Sottani et al. and Villandre et al. 

This jupyter notebook (in R) details the extraction of clustering results from the multi-state birth-death method by Barido-Sottani et al. (2018) as well as the optimal results from cutpoint-based methods analysed by Villandre et al. (2016). 

Before using this notebook, download the simulation data (supplementary files S9-S12) by [Villandre et al. (2016)](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0148459) and save them in the ./data/ folder.

## Libraries and Inputs

In [1]:
require("ape")
require(data.table)
require(phangorn)
require(igraph)

data_dir <- './data/'

Loading required package: ape
Loading required package: data.table
Loading required package: phangorn
Loading required package: igraph

Attaching package: ‘igraph’

The following object is masked from ‘package:phangorn’:

    diversity

The following objects are masked from ‘package:ape’:

    edges, mst, ring

The following objects are masked from ‘package:stats’:

    decompose, spectrum

The following object is masked from ‘package:base’:

    union



## Parse simulation data 

Simulation data can be downloaded from the supplementary files provided by [Villandre et al (2016)](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0148459)

In [2]:
setwd(data_dir)
t1 <- get(load('C_data_weight1.RDATA'))
t2 <- get(load('C_data_weight0.75.RDATA'))
t3 <- get(load('C_data_weight0.5.RDATA'))
t4 <- get(load('C_data_weight0.25.RDATA'))
C_simData <- list(t1, t2, t3, t4)

## Extract network annotations, simulation trees and cutpoint clustering results (WPGMA) from Villandre et al. (2016)

Scripts were adapted and originally from supplementary files provided by [Barido-Sottani et al. (2018)](http://rsif.royalsocietypublishing.org/content/15/146/20180512)

In [3]:
source("simData_to_trees.R")
dir.create(file.path('./', 'trees'))
dir.create(file.path('./', 'wpgma'))

get_trees_and_network_annotations <- function(vars,names) {
  outputdf <- data.frame(matrix(ncol=5, nrow=0))
  colnames(outputdf) <- c("d", "i", "j", "label", "real")
  for(v in 1:length(vars)) {
    
    t = vars[[v]]
    n = length(t)
    sd = names[v]
    
    # i = weight index 
    for(i in 1:n) {
      
      simout_file = paste0(sd,"_IS_",i,"_data_table.RData")
      simout <- get(load(simout_file))
        
      n2=length(t[[i]])
      # j = simulation index 
      for(j in 1:n2) {
        
        data = t[[i]][[j]]
        
        tree <- simout$tree[[j]]
        tree_to_print <- as.phylo(tree)
        treefname <- paste0("./trees/",sd,"_",i,"_",j,".nwk")
        write.tree(tree_to_print, file=treefname)
          
        labels = tree$tip.label
          
        # WPGMA cluster results 
        wpgma_results <- as.data.frame(data$clusters[[1]]$trueWPGMA)
        wpgma_fname <- paste0("./wpgma/",sd,"_",i,"_",j,"_wpgma.csv")
        write.csv(wpgma_results, wpgma_fname)
        
        real = data$clusterIndices[as.numeric(labels)]
        
        tmp <- data.frame(sd, i, j, labels, real)
        names(tmp) <- c("d", "i", "j", "label", "real")
        outputdf <- rbind(outputdf, tmp)
      }
    }
  }
  return (outputdf)
}

prepare_data_tables(list(C_simData), c("C"))
annotations = get_trees_and_network_annotations(list(C_simData), list('C'))
write.csv(annotations, 'network_annotations.csv',row.names = F)

## Extract MSBD clustering by Barido-Sottani et al. (2018)

Here, we extract the clustering results using the MSBD model by Barido-Sottani et al. (2018) on the Villandre dataset. Download B_prec_lim8_ML_scores.RData and C_fast_lim8_ML_scores.RData from the supplementary data files provided and deposit them in the ./data/ folder. 

In [4]:
# MSBD clustering results 
dir.create(file.path('./', 'MSBD'))

for(sd in list("C_fast")) {
  rfile=paste0(sd,'_lim8_ML_scores.RData')
  result <- get(load(rfile))
  for(weightid in 1:4) {
    for(simid in 1:300) {
      fname=paste0("./MSBD/",sd,'_',weightid,'_',simid,'.csv')
      write.table(as.data.frame(result[[weightid]][[simid]]),file=fname, quote=F,sep=",",row.names=F)
    }
  }
}