# Analysis of simulated data


TODO add here a brief description of your project or analysis. Explain the purpose, goals, or context of the notebook here.

## Table of Contents
- [1. Chapter 1 - Introduction](#chapter-1-introduction)
- [2. Chapter 2 - R packages](#chapter-2-R-packages)
- [3. Chapter 3 - Input specification](#chapter-3-Input-specification)
- [4. Chapter 4 - Supermatrix and Supertree trees VS true topologies](#chapter-4-Supermatrix-and-Supertree-trees-VS-true-topologies)
- [5. Chapter 5 - BigTree VS true topologies](#chapter-5-BigTree-VS-true-topologies)
- [6. Chapter 6 - RF distance summaries](#chapter-6-RF-distance-summaries)
    - [6.1 Subchapter 6.1 - Table 1](#subchapter-6_1-Table-1)
    - [6.2 Subchapter 6.2 - Figure 2](#subchapter-6_2-Figure-2)
- [7. Chapter 7 - Running times](#chapter-7-Running-times)
    - [7.1 Subchapter 7.1 - Figure 3](#subchapter-7_1-Figure-3)
- [8. Chapter 8 - Bootstrap analysis](#chapter-8-Bootstrap-analysis)

<a id="chapter-1-introduction"></a>

## Chapter 1 - Introduction

TODO add description of notebook and how to use it.


<a id="chapter-2-R-packages"></a>

## Chapter 2 - R packages

Here is the list of packages needed for this notebook.

In [21]:
library("phangorn")
library("ggplot2")
library("reshape2")
library("geiger")
library("adephylo")
library("phytools")
library("rlist")
library("plotrix")
library("plyr")
library("dplyr")
library("hrbrthemes")
library("rstatix")
library("gplots")
library("viridis")
library('gtools')
library(gridExtra)
library(grid)
library("ggforce")
library("scales")

TODO make a chapter for aqll custom functions

<a id="chapter-3-Input-specification"></a>

## Chapter 3 - Input specification

Change the parent_of_all_dir variable to give the location of the input. All input files are expected to be under it. Otherwise change the single input location to you liking.<br>
All input files or directory paths are listed here.

In [2]:
# the location of the directory that contains all inputs for analyses 
parent_of_all_dir = "/users/cn/abaltzis/projects/1_paralogs/simulated_datasets/simulated_sequences/"

# The directory (one or more) that contain all the results from the main.nf pipeline run. It can be a glob path.
dir_containing_all_pipe_results = "../results" # todo make this disappear once analysis are ready
#dir_containing_all_pipe_results = paste0(parent_of_all_dir, "rerun_25_species_corrected_branches/rerun_sim_b_factor*")

#  TODO describe what this file does
reference_tree_codefile = file.path(parent_of_all_dir, "ref_tree.code")

# get the species/organism list of names used by the main.nf pipeline
species_name_file = file.path(parent_of_all_dir, "orthologs_org_ids_to_concatenate")


# each run of main.nf has one output directory. here we get that or multiple runs outputs directories.
run_dirs = Sys.glob(dir_containing_all_pipe_results)
run_dirs = run_dirs[file.info(run_dirs)$isdir]

# Create the 'figures' directory if it doesn't exist. Created where this notebook is executed.
if (!dir.exists("figures")) {
    dir.create("figures")
}

# Create the 'tables' directory if it doesn't exist. Created where this notebook is executed.
if (!dir.exists("tables")) {
    dir.create("tables")
}

<a id="chapter-4-Supermatrix-and-Supertree-trees-VS-true-topologies"></a>

## Chapter 4 - Supermatrix and Supertree trees VS true topologies

TODO add description 

In [11]:
# setting global lists that eill hold all family level values
RF_avg_list = list()
collect_branch_len_supermatrix_ml = c()
collect_diameter_supermatrix_ml = c()

for (run in run_dirs) {
    print(basename(run))

    # the mapping used later on to rename species trees according to a more generic sequence name
    ref_codefile = read.table(file = reference_tree_codefile )
    
    # get al the family subdirectories in each mqin.nf output run dir
    fam_dirs = grep("avg_branchlen_0.7_protein_family*", list.dirs(run, recursive = FALSE), value = TRUE)
    count_fam = 0

    # this will contain the mean branch length for all true/reference paralog trees for each gene family
    avg_branch_length = c()

    # read the names of the ortholog species.
    orgs = read.table(file = species_name_file)

    # get all units trees for all combination of methods 
    for (m in fam_dirs) {
        
        count_fam = count_fam + 1

        # print every 10 families a check on progress to screen
        if (count_fam %% 10 == 0) {
            print(m)
        }

        # Get all units trees (250 = 25 organism * 10 samples/statistical repeats) for the combination Minimun Evolution (ME) + supermatrix method (SM). 
        # We unroot every tree and fully resolve it to use the RF.dist package and others. 
        all_units_file  = file.path(m, "results/rerun", "all_units.nwk")
        all_units_trees = read.tree(all_units_file, keep.multi=TRUE)
        all_units_trees = unroot(all_units_trees)
        
        # check if there any non-bynary (consensus trees) among the 250 trees present in the file above
        if (length(which(is.binary(all_units_trees) == FALSE)) != 0) {
            
            # if there are non-binary trees, iterate through them and transformed into a binary one using multi2di package
            for (tree in which(is.binary(all_units_trees) == FALSE)) {
                all_units_trees[[tree]] = multi2di(all_units_trees[[tree]])
            }
        }

        # Get all units trees (250) ME + supertree (ST) method and handle non-binary trees as above
        all_superfine_file  = file.path(m, "results/rerun", "all_units_superfine_trees.nwk")
        all_superfine_trees = read.tree(all_superfine_file, keep.multi=TRUE)
        all_superfine_trees = unroot(all_superfine_trees)
        if (length(which(is.binary(all_superfine_trees) == FALSE)) != 0) {
            for (tree in which(is.binary(all_superfine_trees) == FALSE)) {
                all_superfine_trees[[tree]] = multi2di(all_superfine_trees[[tree]])
            }
        }

        # Get all units trees (250) Maximum Likelihood (ML) + SM 
        all_units_raxml_file  = file.path(m, "results/rerun", "all_units_RAxML_trees.nwk")
        all_units_raxml_trees = read.tree(all_units_raxml_file, keep.multi=TRUE)
        all_units_raxml_trees = unroot(all_units_raxml_trees)
        if (length(which(is.binary(all_units_raxml_trees) == FALSE)) != 0) {
            for (tree in which(is.binary(all_units_raxml_trees) == FALSE)) {
                all_units_raxml_trees[[tree]] = multi2di(all_units_raxml_trees[[tree]])
            }
        }

        # take only the last 10 of the 250 above trees (unit25's 10 samples/replicates)(unit25 = all 25 single units concatenated) and collect branch lengths (each branch in tree, 27 in total 2*15-3) and diameter (using patristic distance)
        for(mytree in 241:length(all_units_raxml_trees)) {
            collect_branch_len_supermatrix_ml = c(collect_branch_len_supermatrix_ml, all_units_raxml_trees[[mytree]]$edge.length)
            collect_diameter_supermatrix_ml = c(collect_diameter_supermatrix_ml, max(distTips(all_units_raxml_trees[[mytree]], method = 'patristic')))
        }

        # Get all units trees (250) ML + ST
        all_raxml_superfine_file  = file.path(m, "results/rerun", "all_units_RAxML_superfine_trees.nwk")
        all_raxml_superfine_trees = read.tree(all_raxml_superfine_file, keep.multi=TRUE)
        all_raxml_superfine_trees = unroot(all_raxml_superfine_trees)
        if (length(which(is.binary(all_raxml_superfine_trees) == FALSE)) != 0) {
            for (tree in which(is.binary(all_raxml_superfine_trees) == FALSE)) {
                all_raxml_superfine_trees[[tree]] = multi2di(all_raxml_superfine_trees[[tree]])
            }
        }

        # Reference tree block, each gene family has it's own reference tree that is the one generated by the simulation step. Along this tree all sequences are generated and then alligned and passed as input for main.nf pipeline.
        ref_tree_file = file.path(run, paste0(basename(m), ".nwk"))
        ref_tree      = read.tree(ref_tree_file, keep.multi=FALSE)

        # Since each gene family has a predifined gene tree (ref_tree) composed of a fixed species tree pasted as leaf of the paralog tree. The tree is designed like this by choice. Each gene families varies only in the paralog tree that it has that is unique (15 tips).
        # Here we extract the paralog tree and to do so we just need to extract the subtree with all the sequences of omne species. Since each species have the same paralog tree. So we can extract the subtree of whichever species it does not matter.
        ref_subtree   = keep.tip(ref_tree, as.character(ref_codefile$V1))

        # we then rename the tips of the paralog tree (ref_subtree) to a more generic keyname. Effectively going from rat1 to seq1. 
        for (j in 1:length(ref_subtree$tip.label)) { 
            ref_subtree$tip.label[j] = as.character(ref_codefile$V2[which(ref_codefile$V1 == ref_subtree[["tip.label"]][j])])
        }

        # Unroot the tree and perform aother operation and then save the paralog tree to the specifc gene faimily directory
        ref_subtree        = unroot(ref_subtree)
        ref_subtree        = list(ref_subtree)
        class(ref_subtree) = "multiPhylo"
        subtree_out_file   = file.path(m, "results/rerun", paste0("ref_subtree_", basename(m), ".nwk"))
        # TODO remove this comment write.tree(ref_subtree, file=subtree_out_file)

        # update on the average branch length with the mean of the true/refernce paralog tree of the given gene family
        avg_branch_length = c(avg_branch_length, mean(ref_subtree[[1]]$edge.length))

        # Make a list containing all trees gathered so far and compute all possible pair comparisons among trees. There are 1001*1001-1001 unique pairewise comparisons ( a square matrix of size 1001 with 0 in diagonal ).
        # On this matrix (plotdata) all other measuramnts are done. This is a symmetric matrix (distance matrix)
        all_trees        = c(ref_subtree, all_units_trees, all_superfine_trees, all_units_raxml_trees, all_raxml_superfine_trees)
        class(all_trees) = "multiPhylo"
        all_vs_all_rf    = RF.dist(all_trees, normalize = TRUE)
        plotdata         = data.matrix(all_vs_all_rf)

        # all ME metrics to be computed superfine = SuperTree
        mean_rf           = c()
        mean_superfine_rf = c()

        # all ML metrics to be computed
        mean_raxml_rf           = c()
        mean_raxml_superfine_rf = c()

        # proceed throught the plotdata matrix taking from the second element one every ten up to 242. Basically proceed at chunkcs of 10 (samples). On those chuncks compute the mean and the Standard error. 
        # average all sample "values" whithin a unit. going from 250 value to 25 averages
        # This iterate only through the values of the comparison between the reference paralofg tree (ref_subtree)(first column/row of matrix) and the 250 paraogs trees of ME + SM (all_units_trees)
        for (i in seq(2, (length(orgs$V1)*10)+1, by=10)) {
            mean_rf = c(mean_rf, mean(plotdata[1, i:(i+9)]))
        }

        # This instead goes from row 252 to 492 of the plotdata matrix. It still goes in chuncks of 10 (samples). 
        # So this effectively does what is done above but between the reference paralofg tree and the 250 paraogs trees of ME + ST (all_superfine_trees)
        for (i in seq( (length(orgs$V1)*10)+2, (2*(length(orgs$V1)*10)+1), by=10 )) {
            mean_superfine_rf = c(mean_superfine_rf, mean(plotdata[1,i:(i+9)]))
        }

        # This for loop goes from 502 to 742. It still goes in chuncks of 10 (samples).
        # So this effectively does what is done above but between the reference paralofg tree and the 250 paraogs trees of ML + SM (all_units_raxml_trees)
        for (i in seq( (2*(length(orgs$V1)*10))+2, (3*(length(orgs$V1)*10)+1), by=10 )) {
            mean_raxml_rf = c(mean_raxml_rf, mean(plotdata[1,i:(i+9)]))
        }

        # This for loop goes from 752 to 992. It still goes in chuncks of 10 (samples).
        # So this effectively does what is done above but between the reference paralofg tree and the 250 paraogs trees of ML + ST (all_raxml_superfine_trees)
        for (i in seq( (3*(length(orgs$V1)*10))+2, (4*(length(orgs$V1)*10)+1), by=10)) {
            mean_raxml_superfine_rf = c(mean_raxml_superfine_rf, mean(plotdata[1,i:(i+9)]))
        }

        # collect all mean RF values 25 mean_rf, 25 mean_superfine_rf ecc.. for each family. One for each unit.
        collection_mean_rf      = c(mean_rf, mean_superfine_rf, mean_raxml_rf, mean_raxml_superfine_rf)
  
        # Finally update the "global" lists created at the beginning
        RF_avg_list      = list.append(RF_avg_list, collection_mean_rf)

    }
}



[1] "rerun_sim_b_factor_3.5_avg_branchlen_0.7_from_11th_17th_20th_22th_and_23th_50_fams_25_species"
[1] "/users/cn/abaltzis/projects/1_paralogs/simulated_datasets/simulated_sequences/rerun_25_species_corrected_branches/rerun_sim_b_factor_3.5_avg_branchlen_0.7_from_11th_17th_20th_22th_and_23th_50_fams_25_species/avg_branchlen_0.7_protein_family_010.sub"
[1] "/users/cn/abaltzis/projects/1_paralogs/simulated_datasets/simulated_sequences/rerun_25_species_corrected_branches/rerun_sim_b_factor_3.5_avg_branchlen_0.7_from_11th_17th_20th_22th_and_23th_50_fams_25_species/avg_branchlen_0.7_protein_family_020.sub"
[1] "/users/cn/abaltzis/projects/1_paralogs/simulated_datasets/simulated_sequences/rerun_25_species_corrected_branches/rerun_sim_b_factor_3.5_avg_branchlen_0.7_from_11th_17th_20th_22th_and_23th_50_fams_25_species/avg_branchlen_0.7_protein_family_030.sub"
[1] "/users/cn/abaltzis/projects/1_paralogs/simulated_datasets/simulated_sequences/rerun_25_species_corrected_branches/rerun_sim_b_fact

In [12]:
# transform the list of lists (RF_avg_list) into a dataframe (matrix) size 500 * 100 (25 units for all combinations of ME,ML and SM,ST)
RF_avg_as_df = ldply(RF_avg_list, recursive=FALSE)

# create the column names for the later dataframe, ME = minimum Evolutio, ML = Maximum Likelihood, SM = SuperMatrix , ST = SuperTree. names will be like ME_SM_unit_1
units_names = c()
for (i in 1:length(orgs$V1)) {
    units_names = c(units_names, paste("ME_SM", "unit", i, sep="_"), 
                    paste("ME_ST", "unit", i, sep="_"), 
                    paste("ML_SM", "unit", i, sep="_"), 
                    paste("ML_ST", "unit", i, sep="_"))
}

# natural_sort them or human readable sort, so that the order is  ME_SM unit1...unit25, ME_ST unit1....uniy25, ML_SM and ML_ST
units_names = mixedsort(units_names)

# TODO see if standar error se was actually usefull
RF_df = round(RF_avg_as_df, digits=3)

# add the units name generated before as column names for the dataframe and save it to file
colnames(RF_df) = units_names
write.table(RF_df, file = "tables/avg_RF_between_ref_supermatrix_and_supertree.tsv", quote = FALSE, row.names = FALSE)

<a id="chapter-5-BigTree-VS-true-topologies"></a>

## Chapter 5 BigTree VS true topologies

TODO add description

In [37]:
# Trees from BigTree (full_aln) ME/ML (paralog tree) VS true topologies 

paralog_ME_vs_true_tree = c()
paralog_ML_vs_true_tree = c()
RF_matrices = list()


collect_branch_len_sub_from_ref = c()
collect_diameter_sub_from_ref = c()

for (run in run_dirs) {
    print(basename(run))

    # the mapping used later on to rename species trees according to a more generic sequence name
    ref_codefile = read.table(file = reference_tree_codefile )
    
    # get al the family subdirectories in each mqin.nf output run dir
    fam_dirs = grep("avg_branchlen_0.7_protein_family*", list.dirs(run, recursive = FALSE), value = TRUE)
    count_fam = 0

    # this will contain the mean branch length for all true/reference paralog trees for each gene family
    avg_branch_length = c()

    # read the names of the ortholog species.
    orgs = read.table(file = species_name_file)

    # extract and work on all BigTree paralog files for all families. this trees are computed from the main.nf pipeline in simulation mode.
    for (m in fam_dirs) {
        
        count_fam = count_fam + 1

        # print every 10 families a check on progress to screen
        if (count_fam %% 10 == 0) {
            print(m)
        }

        # read the paralog tree obtained from the BigTree (gene family tree) created unsing ME.
        # the paralog tree was obtained extracting each species subtree for the gene family tree, then this (25) subtrees were merged into one exactly like is done for SuperTree (using superfine).
        fastme_paralog_tree_file          = file.path(m, "results/rerun", paste0(basename(m), "_full_aln_coded.phylip_fastme_tree_paralog_tree.nwk"))
        fastme_paralog_tree_from_full_aln = read.tree(fastme_paralog_tree_file, keep.multi=FALSE)
        fastme_paralog_tree_from_full_aln = unroot(fastme_paralog_tree_from_full_aln)
        fastme_paralog_tree_from_full_aln = list(fastme_paralog_tree_from_full_aln)


        # read the paralog tree obtained from the BigTree (gene family tree) created unsing ML. paralog tree constructed same way as above.
        raxml_paralog_tree_file           = file.path(m, "results/rerun", paste0("RAxML_bestTree.", basename(m), "_full_aln_coded_raxml_paralog_tree.nwk"))   
        raxml_paralog_tree_from_full_aln  = read.tree(raxml_paralog_tree_file, keep.multi=FALSE)
        raxml_paralog_tree_from_full_aln  = unroot(raxml_paralog_tree_from_full_aln)
        raxml_paralog_tree_from_full_aln  = list(raxml_paralog_tree_from_full_aln)
                
        # read the reference tree (gene family tree) the one set as thruth and on which the sequences were generated along.
        ref_tree_file = file.path(run, paste0(basename(m), ".nwk"))
        ref_tree      = read.tree(ref_tree_file, keep.multi=FALSE)

        # each reference species paralog tree is identical in topologty (designed so) but diverges for branch lengths. here we retrieve some info for that.
        for (species in orgs$V1) {
            
            # ectract tips like before and then extract subtree and unroot it. 
            # this is the refence paralog tree for each species and whithin one family all species have the same paralog tree as far as topology goes, what changes between this species paralog trees is their branch lengths.
            tips_to_keep = grep(paste0("^", species), ref_tree$tip.label, perl = TRUE)
            test_subtree = keep.tip(ref_tree, tips_to_keep)
            test_subtree = unroot(test_subtree)

            # get branch length and diameter
            collect_branch_len_sub_from_ref = c(collect_branch_len_sub_from_ref, test_subtree$edge.length)
            collect_diameter_sub_from_ref   = c(collect_diameter_sub_from_ref, max(distTips(test_subtree, method = 'patristic')))
        }

        # Since each gene family has a predifined gene tree (ref_tree) composed of a fixed species tree pasted as leaf of the paralog tree. The tree is designed like this by choice. Each gene families varies only in the paralog tree that it has that is unique (15 tips).
        # Here we extract the paralog tree and to do so we just need to extract the subtree with all the sequences of one species. Since each species have the same paralog tree. So we can extract the subtree of whichever species it does not matter.
        ref_subtree   = keep.tip(ref_tree, as.character(ref_codefile$V1))

        # we then rename the tips of the paralog tree (ref_subtree) to a more generic keyname. Effectively going from rat1 to seq1. 
        for (j in 1:length(ref_subtree$tip.label)) { 
            ref_subtree$tip.label[j] = as.character(ref_codefile$V2[which(ref_codefile$V1 == ref_subtree[["tip.label"]][j])])
        }

        # Unroot the tree and put into a list
        ref_subtree        = unroot(ref_subtree)
        ref_subtree        = list(ref_subtree)


        # update on the average branch length with the mean of the true/refernce paralog tree of the given gene family
        avg_branch_length = c(avg_branch_length, mean(ref_subtree[[1]]$edge.length))
        
        # put all  BIgTree (ME/ML) paralog tres (2) and the reference paralog tree into one list.
        all_trees        = c(ref_subtree, fastme_paralog_tree_from_full_aln, raxml_paralog_tree_from_full_aln)
        class(all_trees) = "multiPhylo"
        
        # compute the RF dictance between 1 refernce paralog the other two paralog trees: ME ML. In reality all difrrences between them.
        all_vs_all_rf = RF.dist(all_trees, normalize=TRUE)
        plotdata      = data.matrix(all_vs_all_rf)
        RF_matrices   = list.append(RF_matrices, plotdata)
        
        # do the average of all pair comparison between the species paralog trees and the reference tree. -c[1] means not consider the first column, because first row first column is RF between reference tree and itself
        paralog_ME_vs_true_tree = c(paralog_ME_vs_true_tree, plotdata[1, 2])
        paralog_ML_vs_true_tree = c(paralog_ML_vs_true_tree, plotdata[1, 3])
    }
}

write.table(paralog_ME_vs_true_tree, file = "tables/paralog_ME_vs_true_tree.tsv", quote=FALSE, row.names = FALSE, col.names = FALSE)
write.table(paralog_ML_vs_true_tree, file = "tables/paralog_ML_vs_true_tree.tsv", quote=FALSE, row.names = FALSE, col.names = FALSE)

[1] "results"
[1] "../results/avg_branchlen_0.7_protein_family_100.sub"


<a id='chapter-6-RF-distance-summaries'></a>

## Chapter 6 - RF distance summaries

todo add description

<a id='subchapter-6_1-Table-1'></a>

### Subchapter 6.1 - Table 1
todo add description.

In [6]:
# read the table compiled in chapter 4. So this cell does not depend on that one to run, it just needs the table file.
df_new = read.table("tables/avg_RF_between_ref_supermatrix_and_supertree.tsv", header=T)

# do the mean and sd of the column. effectively getting a vector of size 100 filed with averages of averages and with ids like SM_ML_unit9
rf_df_new_means = colMeans(df_new)
sd_df_new_means = apply(df_new, 2, sd)

# Initialize empty list to hold the extracted method names and unit names values
methods = c("ME_SM", "ME_ST", "ML_SM",  "ML_ST")
units   = 25

# Create an empty dataframe with row names for methods and columns for unit_1 to unit_25
avg_rf_df_method_unit           = data.frame(Method = methods, matrix(ncol = units, nrow = 4))
colnames(avg_rf_df_method_unit) = c("Method", paste0("unit_", 1:units))

# Fill the dataframe using the keys of the named vector of rf_df_new_means, these names are the same for the vector sd_df_new_means too.
for (key in names(rf_df_new_means)) {

  # Split the key into parts (method and unit)
  parts  = strsplit(key, "_unit_")[[1]]
  method = parts[1]
  unit   = as.numeric(parts[2])

  # Place the value in the correct row and column of the dataframe adding the standard deviation to it.
  avg_rf_df_method_unit[avg_rf_df_method_unit$Method == method, paste0("unit_", unit)] = paste0(round(rf_df_new_means[key], digit=2), "  sd  ", round(sd_df_new_means[key], digit=2))
}

# create the actual table with only unit1, unit25 difference values from above df
diff   = c(as.numeric(sub(" sd.*", "", avg_rf_df_method_unit$unit_1)) - as.numeric(sub(" sd.*", "", avg_rf_df_method_unit$unit_25)))
table1 = data.frame(Method = avg_rf_df_method_unit$Method,
                    "Unit 25"    = avg_rf_df_method_unit$unit_25,
                    "Unit 1"     = avg_rf_df_method_unit$unit_1,
                    "Difference" = diff )

# read the table compiled in chapter 5. So this cell does not depend on that one to run, it just needs the table file.
df_bigtree_ME = read.table("tables/paralog_ME_vs_true_tree.tsv", header=F)
df_bigtree_ML = read.table("tables/paralog_ML_vs_true_tree.tsv", header=F)

# make average of all family BigTree RF (ME and ML) distances to their respective paralog tree. going from 500 values to 1.
rf_df_bigtree_ME_mean = colMeans(df_bigtree_ME)
rf_df_bigtree_ME_sd   = apply(df_bigtree_ME, 2, sd)
rf_df_bigtree_ML_mean = colMeans(df_bigtree_ML)
rf_df_bigtree_ML_sd   = apply(df_bigtree_ML, 2, sd)

# add the above four values to table1
bigtree_ME_row = c("ME_BigTree", paste0(round(rf_df_bigtree_ME_mean, digits = 2), "  sd  ", round(rf_df_bigtree_ME_sd, digits = 2)), "Na", "Na")
bigtree_ML_row = c("ML_BigTree", paste0(round(rf_df_bigtree_ML_mean, digits = 2), "  sd  ", round(rf_df_bigtree_ML_sd, digits = 2)), "Na", "Na")
table1      = rbind(table1, bigtree_ME_row)
table1      = rbind(table1, bigtree_ML_row)

table1
write.table(table1, file="tables/Table1_average_RF.tsv", quote=FALSE, sep="\t", row.names=FALSE)

Method,Unit.25,Unit.1,Difference
<chr>,<chr>,<chr>,<chr>
ME_SM,0.24 sd 0.15,0.47 sd 0.13,0.23
ME_ST,0.23 sd 0.13,0.47 sd 0.13,0.24
ML_SM,0.19 sd 0.14,0.44 sd 0.13,0.25
ML_ST,0.2 sd 0.12,0.44 sd 0.13,0.24
ME_BigTree,0.25 sd 0.16,Na,Na
ML_BigTree,0.18 sd 0.15,Na,Na


<a id="subchapter-6_2-Figure-2"></a>

### Subchapter 6.2 Figure 2

TODO add description

In [11]:
# Set plot dimensions and resolution for inline display
options(repr.plot.width = 6, repr.plot.height = 5, repr.plot.res = 300)

# Load the necessary data
# Reading in average RF values between reference big tree ME/ML from a file computed in chapter 5.
df_bigtree_ME = read.table("tables/paralog_ME_vs_true_tree.tsv", header=F)
df_bigtree_ML = read.table("tables/paralog_ML_vs_true_tree.tsv", header=F)
rf_df_bigtree_ME_mean = colMeans(df_bigtree_ME)
rf_df_bigtree_ML_mean = colMeans(df_bigtree_ML)

# Reading in average RF values between reference supermatrix and supertree
df_new = read.table("../analysis/tables/avg_RF_between_ref_supermatrix_and_supertree.tsv", header = TRUE)

# Calculate column-wise means of the loaded data
rf_df_new_means = colMeans(df_new)

# Reshape the means data into a long format
rf_df_new_means_melted = melt(rf_df_new_means)

# Rename the first column to "RF" for clarity
colnames(rf_df_new_means_melted)[1] = "RF"

# Add a column indicating the topologies (25 levels, repeated 4 times for each method)
rf_df_new_means_melted$Topology = c(rep(colnames(df_new)[1:25], 4))

# Add a column specifying the method (Supermatrix-ME, Supertree-ME, Supermatrix-ML, Supertree-ML)
rf_df_new_means_melted$Method = c(rep("Supermatrix-ME", 25), rep("Supertree-ME", 25),
                                  rep("Supermatrix-ML", 25), rep("Supertree-ML", 25))

# Sort the data by topology using mixedorder
rf_df_new_means_melted = rf_df_new_means_melted[mixedorder(rf_df_new_means_melted$Topology), ]

# crate the x axis label names
x_axis_labels = gsub("ME_SM_", "", colnames(df_new)[1:25])

# Modify the Topology column to match the cleaned x-axis labels
rf_df_new_means_melted$Topology = rep(x_axis_labels, each = 4)  # Repeat each value per method (4 times)

# Define custom colors for each Method
custom_colors = c("Supertree-ME" = "#7CB6E2",     # Light blue 
                  "Supermatrix-ME" = "#2F739B",   # Dark blue
                  "Supertree-ML" = "#E5A23D",     # Orange/yellow
                  "Supermatrix-ML" = "#D05159")   # Red

# Create the line plot
p = ggplot(rf_df_new_means_melted, aes(x = factor(Topology, levels = x_axis_labels), 
                                       y = RF, group = Method, color = Method)) + 
    geom_line() +  # Add lines connecting the points by method
    geom_point(shape = 16) +  # Use shape 16 (solid dot) for all points

    # Add a horizontal dashed line representing the average RF of ME big tree
    geom_hline(yintercept = rf_df_bigtree_ME_mean, 
               linetype = "dashed", color = "#606060") +
    
    # Annotate the plot with a label for the average bigtree ME RF line
    annotate("text", x = 18, y = rf_df_bigtree_ME_mean + 0.025, 
             label = "Avg RF between BigTree-ME and true topology", 
             size = 3, color = "#606060") +

    # Add a horizontal dashed line representing the average RF of ML big tree
    geom_hline(yintercept = rf_df_bigtree_ML_mean, 
               linetype = "dashed", color = "#1E7005") +
    
    # Annotate the plot with a label for the average bigtree ML RF line
    annotate("text", x = 12, y = rf_df_bigtree_ML_mean - 0.025, 
             label = "Avg RF between BigTree-ML and true topology", 
             size = 3, color = "#1E7005") +

    # Apply a light theme for better readability
    theme_light() + 

    # Adjust x-axis text to be tilted at 45 degrees and align to the right
    theme(axis.text.x = element_text(angle = 45, hjust = 1), 
          legend.position = "right",
          axis.title.x = element_text(face = "bold"),  
          axis.title.y = element_text(face = "bold"))  + 
    
    # Label the axes
    ylab("Normalized RF") + 
    xlab("Topology") + 

    # Set y-axis limits between 0 and 0.5
    scale_y_continuous(limits = c(0, 0.5)) +

    # Customize color and shape scales for the different methods
    scale_color_discrete(breaks = c("Supertree-ME", "Supermatrix-ME", "Supertree-ML", "Supermatrix-ML")) + 
    scale_shape_discrete(breaks = c("Supertree-ME", "Supermatrix-ME", "Supertree-ML", "Supermatrix-ML")) +
    scale_color_manual(values = custom_colors)  # Apply custom colors

# Save the plot to a file with high resolution
ggsave(filename = "figures/RF_lineplot_figure2.png", plot = p, dpi = "retina")

[1m[22mScale for [32mcolour[39m is already present.
Adding another scale for [32mcolour[39m, which will replace the existing scale.
[1m[22mSaving 7 x 7 in image


<a id="chapter-7-Running-times"></a>

## Chapter 7 - Running times

TODO add description 

In [6]:
# read the trace files for running times
for (run in run_dirs) {
    print(basename(run))

    # each main.nf output run has a trace file repoprting various information on the run itself
    tracefile_path = file.path(run, "trace_cpu_time.txt")
    tracefile      = read.table(tracefile_path, header = TRUE)
    
    # here we parse the trace file to retain only the processes that we want to take the running time (realtime) from
    tracefile_mod  = tracefile[grep("simulated_data:run_phylo_ML_supermatrix_aln_sim|simulated_data:only_concatenate_aln_sim|simulated_data:run_phylo_ML_full_aln_sim", tracefile$native_id), c("name", "native_id", "realtime")]
    
    # transforming the above table to one that have the process names (simulated_data:run_phylo_ML_supermatrix_aln_sim ecc..) as column names, 
    # the rows will be still the family names (working as ID) and the values in each column will be the running times associated to the column name (process name) for that given family.
    tracefile_mod_wide = spread(tracefile_mod, native_id, realtime)

    # creating a new last column called Supermatrix_ML that has aas values the running times of concatenation plus ML tree computation (on concatenated MSA) 
    tracefile_mod_wide$Supermatrix_ML = as.numeric(tracefile_mod_wide$`simulated_data:only_concatenate_aln_sim`) + as.numeric(tracefile_mod_wide$`simulated_data:run_phylo_ML_supermatrix_aln_sim`)

    # retain only the family name, BIgTree ML running time and SM + ML running time
    tracefile_mod_wide = tracefile_mod_wide[,c(1,3,5)]

    # call the second column name Big_Tree and set their value to numeric type
    colnames(tracefile_mod_wide)[2] = "Big_Tree"
    tracefile_mod_wide$Big_Tree = as.numeric(tracefile_mod_wide$Big_Tree)

    # adding a first column that groups families by the run of the main.nf
    tracefile_mod_wide$family_group = rep(basename(run), 100)

    # add tables toghether raw-wise
    if(run == run_dirs[1]) {
        fin_tracefile_mod_wide = tracefile_mod_wide
    } else {
        fin_tracefile_mod_wide = rbind(fin_tracefile_mod_wide, tracefile_mod_wide)
    }
}

[1] "rerun_sim_b_factor_3.5_avg_branchlen_0.7_from_11th_17th_20th_22th_and_23th_50_fams_25_species"
[1] "rerun_sim_b_factor_3.5_avg_branchlen_0.7_from_17th_20th_22th_and_23th_50_fams_25_species"
[1] "rerun_sim_b_factor_3.5_avg_branchlen_0.7_from_20th_22th_and_23th_50_fams_25_species"
[1] "rerun_sim_b_factor_3.5_avg_branchlen_0.7_from_22th_and_23th_50_fams_25_species"
[1] "rerun_sim_b_factor_3.5_avg_branchlen_0.7_from_23th_50_fams_25_species"


In [7]:
# TODO make nicer and update with whatever is changed above

# make the means of all (500) family running time for the BigTree and ML + SM. Bring them to seconds (/1000) and round the digits to 2
round(colMeans(fin_tracefile_mod_wide[, c(2, 3)])/1000,digits = 2)

# compute the median for the BigTree approach and the  ML + SM.
round(median(fin_tracefile_mod_wide$Big_Tree)/1000,digits = 2)
round(median(fin_tracefile_mod_wide$Supermatrix_ML)/1000,digits = 2)

<a id='subchapter-7_1-Figure-3'></a>

### Subchapter 7.1 - Figure 3

TODO add chapter link and description if necessary.

In [8]:
# TODO make this into the section of custom functions.
# Function to assign new values based on a mapping provided by the user
assign_mapped_values <- function(df, column_name, mapping, new_col_name) {
  # Check if mapping is a named vector or a list
  if (!is.vector(mapping) || is.null(names(mapping))) {
    stop("Mapping must be a named vector or list.")
  }
  
  # Get the unique values from the specified column
  unique_values = unique(df[[column_name]])
  
  # Check if the mapping contains all unique values
  if (!all(unique_values %in% names(mapping))) {
    stop("Mapping must include all unique values from the specified column.")
  }
  
  # Initialize a vector to store the assigned values
  assigned_values = vector("character", nrow(df))
  
  # Assign values based on the mapping
  for (i in seq_along(unique_values)) {
    assigned_values[df[[column_name]] == unique_values[i]] <- mapping[names(mapping) == unique_values[i]]
  }
  
  # Add the assigned values to the original dataframe
  df[[new_col_name]] = assigned_values
  
  return(df)
}

In [9]:
#
# Cpu time vs Method (SM + ML and Bigtree) divided per group (by sequence length)
#

# assign to each family_group a mapping values  (as new column). basically insert the length of the sequences in each family as an additional column in the dataframe.
# Remember each family in the same group have the same length of sequence generated, the MSA though might differ in length due to gaps.
mapping = c("rerun_sim_b_factor_3.5_avg_branchlen_0.7_from_23th_50_fams_25_species" = "group1\nlen = 32",
            "rerun_sim_b_factor_3.5_avg_branchlen_0.7_from_22th_and_23th_50_fams_25_species" = "group2\nlen = 46",
            "rerun_sim_b_factor_3.5_avg_branchlen_0.7_from_20th_22th_and_23th_50_fams_25_species" = "group3\nlen = 50",
            "rerun_sim_b_factor_3.5_avg_branchlen_0.7_from_17th_20th_22th_and_23th_50_fams_25_species" = "group4\nlen = 60",
            "rerun_sim_b_factor_3.5_avg_branchlen_0.7_from_11th_17th_20th_22th_and_23th_50_fams_25_species" = "group5\nlen = 82")
fin_tracefile_mod_reassigned = assign_mapped_values(fin_tracefile_mod_wide, "family_group", mapping, "length")

# reorder the columns to have the numerical values at the end
fin_tracefile_mod_reassigned = fin_tracefile_mod_reassigned[, c("name", "family_group", "length", "Big_Tree", "Supermatrix_ML")]

# It takes a table of all families as rows (500) and it goes to a table of 1000 (500 BigTree + 500 SM + ML) where the running times instead of being separated into two different coluns now they are in the same one.
# name + group columns will work as unique ID.  group is the directory name of the main.nf run/s, while name is the name of the family. 
fin_tracefile_mod_melted           = melt(fin_tracefile_mod_reassigned)


# transform running times from millisecond to seconds. And rename the column names.
fin_tracefile_mod_melted$value     = fin_tracefile_mod_melted$value / 1000
colnames(fin_tracefile_mod_melted) = c("name", "family_group", "length", "Method", "Time")

# create a table with 5 pairs of boxplot. Group1, 2.. ecc.. each group has 100 families and two running time (realtime) mesuraments for the BigTree and SM + ML approaches.
# the plot is not showed because this file was executed on a cluster and rendering of plots creates conflicts with the kernel.
p = ggplot(fin_tracefile_mod_melted, aes(x=length, y=Time, fill=Method)) +
    geom_boxplot(show.legend = TRUE) +
    theme_light() +
    labs(x = "sequence length",
       y = "CPU time (sec)",
       fill = "Method")
ggsave(filename="figures/CPU_time_per_group_BigTree_SM_ML_figure3.png", plot=p, dpi = "retina")

Using name, family_group, length as id variables

[1m[22mSaving 7 x 7 in image


### Suplementary Fig todo put number of supplementary figure

In [14]:
#
# Cpu time vs Method ratio (Bigtree / SM + ML ) boxplot of each family ratio by group (x axis groups y ratio Big/SM) 
# 

# start by taking the already present info for the families and their grouping from fig 2 cell and compute ratio Bigtree / SM + ML
# Create a new column with the division of Big_Tree by Supermatrix_ML running times for each family (500)
# and then remove the original columns Big_Tree Supermatrix_ML
fin_tracefile_mod_ratio = fin_tracefile_mod_reassigned %>%
  mutate(fold_ratio = Big_Tree / Supermatrix_ML) %>%
  select(-Big_Tree, -Supermatrix_ML)

# create a table with 5 boxplots. Group1, 2.. ecc.. each group has 100 families and  the ratio of two running time (realtime) mesuraments for the BigTree and SM + ML approaches.
# the plot is not showed because this file was executed on a cluster and rendering of plots creates conflicts with the kernel.
p = ggplot(fin_tracefile_mod_ratio, aes(x=length, y=fold_ratio)) +
    geom_boxplot(show.legend = FALSE) +
    theme_light() +
    labs(x = "sequence length",
       y = "CPU time fold ratio (BigTree / Supermatrix_ML)")

# TODO add subplot number to filename
ggsave(filename="figures/CPU_time_per_group_BigTree_SM_ML_ratio.png", plot=p, dpi = "retina")

[1m[22mSaving 7 x 7 in image


<a id="chapter-8-Bootstrap-analysis"></a>

## Chapter 8 - Bootstrap analysis

TODO add decriptoion

In [65]:
#
# ME + SM block
#


# Create the dataframe that will contain all bs values and other info like presence in reference tree
BS_df_ME_SM = data.frame(
  group = character(), 
  family = character(), 
  unit = integer(), 
  sample = integer(),
  node = integer(),
  BS = numeric(), 
  ref_presence = numeric()  # Using integer type for 0 and 1 values
)

for (run in run_dirs) {
    print(basename(run))
    
    # get al the family subdirectories in each mqin.nf output run dir
    fam_dirs = grep("avg_branchlen_0.7_protein_family*", list.dirs(run, recursive = FALSE), value = TRUE)
    count_fam = 0

    # read the names of the ortholog species.
    orgs = read.table(file = species_name_file)

    # the mapping used later on to rename species trees according to a more generic sequence name
    ref_codefile = read.table(file = reference_tree_codefile )
    
    # work at family level to get the BS value
    for (m in fam_dirs) {
        
        count_fam = count_fam + 1

        # print every 10 families a check on progress to screen
        if (count_fam %% 10 == 0) {
            print(m)
        }

        # read the file that contains the length of the MSA (number of columns) of the input MSA of main.nf for that given family.
        bs_file = file.path(m, "results/rerun", "BS.dat")
        bs      = read.table(bs_file)
        myBS    = bs$V1

        # read the reference tree (gene family tree) the one set as thruth and on which the sequences were generated along.
        ref_tree_file = file.path(run, paste0(basename(m), ".nwk"))
        ref_tree      = read.tree(ref_tree_file, keep.multi=FALSE)

        # Since each gene family has a predifined gene tree (ref_tree) composed of a fixed species tree pasted as leaf of the paralog tree. The tree is designed like this by choice. Each gene families varies only in the paralog tree that it has that is unique (15 tips).
        # Here we extract the paralog tree and to do so we just need to extract the subtree with all the sequences of one species. Since each species have the same paralog tree. So we can extract the subtree of whichever species it does not matter.
        ref_subtree   = keep.tip(ref_tree, as.character(ref_codefile$V1))

        # we then rename the tips of the paralog tree (ref_subtree) to a more generic keyname. Effectively going from rat1 to seq1. 
        for (j in 1:length(ref_subtree$tip.label)) { 
            ref_subtree$tip.label[j] = as.character(ref_codefile$V2[which(ref_codefile$V1 == ref_subtree[["tip.label"]][j])])
        }

        # Unroot the tree and put into a list
        ref_paralog_tree = unroot(ref_subtree)
      
        # do a for loop on the number of species (25) for reading all trees of a given unit (from unit1 to unit25).
        for (k in 1:length(orgs$V1)) {

            # read the file that holds all ME SM trees for a given unit. it holds 10 trees one for each sample/reppliceate.
            all_unit_trees_ME_SM_file  = file.path(m, "results/rerun", paste0("unit_", k, "_all.nwk"))
            all_unit_trees_ME_SM       = read.tree(file = all_unit_trees_ME_SM_file, keep.multi = TRUE)
            all_unit_trees_ME_SM       = unroot(all_unit_trees_ME_SM)

            BS_mean_unit_ME_SM = c()
            
            # for loop on the number of trees in all_unit_trees_ME_SM (10). 
            for (i in 1:length(all_unit_trees_ME_SM)) {

                # Each unit_k_sample_i_rep.trees file contains as many trees as number of columns MSA gene family trees. This are all bootstrap replicate trees for the corresponding ME tree in all_unit_trees_ME_SM. 
                # This represents the variability in tree structure across the bootstrap replicates.
                all_bs_rep_tree_ME_SM_file = file.path(m, "results/rerun", paste0("unit_", k, "_sample_", i, "_rep.trees"))
                all_bs_rep_tree_ME_SM      = read.tree(file = all_bs_rep_tree_ME_SM_file, keep.multi = TRUE)
                all_bs_rep_tree_ME_SM      = unroot(all_bs_rep_tree_ME_SM)

                # calculates the bootstrap support for each clade in the tree by counting how often each clade (node) in the original tree (all_unit_trees_ME_SM[[i]]) is found in the bootstrap replicate trees (all_bs_rep_tree_ME_SM). 
                # The [-1] removes the root node’s support value, which is usually not meaningful in unrooted trees. 
                # This is a list of 12 ( 15 tips -3, 13 internal nodes for unrooted tree and the first clade is the tree itself). because each tree has 15 tips being a paralog tree. Like this by design choiche.0
                BS_all_unit_trees_ME_SM         = prop.clades(all_unit_trees_ME_SM[[i]], all_bs_rep_tree_ME_SM)[-1]
                
                # check now if a clade in the unit tree of interest is also present in the paralog tree. 1 if present NA if missing.
                # this list and the one above are syncronyzed. meaning the first element of both point to the same clade
                presence_of_clade_in_ref_ME_SM = prop.clades(all_unit_trees_ME_SM[[i]], ref_paralog_tree)[-1]

                # Any NA values in all_trees_BS (indicating clades absent in the bootstrap replicates) are replaced with 0.             
                BS_all_unit_trees_ME_SM[is.na(BS_all_unit_trees_ME_SM)]                 = 0
                presence_of_clade_in_ref_ME_SM[is.na(presence_of_clade_in_ref_ME_SM)] = 0

                # Then, the mean of these clade support values is calculated and stored.
                val = mean((as.numeric(BS_all_unit_trees_ME_SM)))
                BS_mean_unit_ME_SM = c(BS_mean_unit_ME_SM, val)

                # Each clade’s support value is scaled by dividing by myBS (which is set to the total number of bootstrap replicates -> number of columns MSA gene family trees).
                # add the clade values to the overall dataframe, teh id of each clade will be the concatenation of the first 5 columns.
                vector3 = 1:12
                tmp_df  = data.frame(
                      group = rep(basename(run), 12),
                      family = rep(m, 12),
                      unit = rep(k, 12),
                      sample = rep(i, 12),
                      node = vector3,
                      BS = c(as.numeric(BS_all_unit_trees_ME_SM) / myBS), # This scales the values to a proportion (0 to 1) for each clade.
                      ref_presence = presence_of_clade_in_ref_ME_SM
                    )

                # add the new block to the overall dataframe
                BS_df_ME_SM = rbind(BS_df_ME_SM, tmp_df)
            }
        }
    }
}

write.table(BS_df_ME_SM, file="tables/df_bootsrtap_clade_presence_in_ref_ME_SM.tsv", quote=FALSE, sep="\t", row.names=FALSE)

[1] "results"
[1] "../results/avg_branchlen_0.7_protein_family_100.sub"


In [None]:
# TODO make this cell into something like tha above but for ML

# read the file that holds all ML SM trees for a given unit. it holds 10 trees one for each sample/reppliceate.
            all_unit_trees_ML_SM_file = file.path(m, "results/rerun", paste0("unit_", k, "_all_raxml.nwk"))
            all_unit_trees_ML_SM      = read.tree(file = all_unit_trees_ML_SM_file, keep.multi = TRUE)
            all_unit_trees_ML_SM      = unroot(all_unit_trees_ML_SM)

            BS_mean_unit_ML_SM = c()

In [64]:
BS_df_ME_SM

group,family,unit,sample,node,BS,ref_presence
<chr>,<chr>,<int>,<int>,<int>,<dbl>,<dbl>
results,../results/avg_branchlen_0.7_protein_family_001.sub,1,1,1,0.06250,1
results,../results/avg_branchlen_0.7_protein_family_001.sub,1,1,2,0.71875,1
results,../results/avg_branchlen_0.7_protein_family_001.sub,1,1,3,0.15625,1
results,../results/avg_branchlen_0.7_protein_family_001.sub,1,1,4,0.31250,0
results,../results/avg_branchlen_0.7_protein_family_001.sub,1,1,5,0.21875,1
results,../results/avg_branchlen_0.7_protein_family_001.sub,1,1,6,0.28125,0
results,../results/avg_branchlen_0.7_protein_family_001.sub,1,1,7,0.40625,1
results,../results/avg_branchlen_0.7_protein_family_001.sub,1,1,8,0.28125,1
results,../results/avg_branchlen_0.7_protein_family_001.sub,1,1,9,0.37500,1
results,../results/avg_branchlen_0.7_protein_family_001.sub,1,1,10,0.40625,0


<br><br><br>
# USEFULL STUFF BUT NOT INCLUDED IN PAPER
<br><br><br>

In [169]:
#
# Cpu time vs Method (SM + ML and Bigtree)   OLD VERSION
#

# It takes a table of all families as rows (500) and it goes to a table of 1000 (500 BigTree + 500 SM + ML) where the running times instead of being separated into two different coluns now they are in the same one.
# name + group columns will work as unique ID.  group is the directory name of the main.nf run/s, while name is the name of the family. 
fin_tracefile_mod_wide_melted           = melt(fin_tracefile_mod_wide)

# transform running times from millisecond to seconds. And rename the column names.
fin_tracefile_mod_wide_melted$value     = fin_tracefile_mod_wide_melted$value / 1000
colnames(fin_tracefile_mod_wide_melted) = c("name", "family_group", "Method", "Time")

# crate a table with two boxplots (x axis) with Method columns as values and on the y the running times.
# the plot is not showed because this file was executed on a cluster and rendering of plots creates conflicts with the kernel. 
p = ggplot(fin_tracefile_mod_wide_melted,aes(x=Method, y=Time, fill=Method)) +
    geom_boxplot(show.legend = FALSE) +
    theme_light() +
    ylab("CPU time (sec)") + xlab("Method")
ggsave(filename="figures/CPU_time_per_method.png", plot=p, dpi = "retina")

Using name, family_group as id variables

[1m[22mSaving 7 x 7 in image


### OLD version all families toghther

### TODO make the following better or remove it

In [48]:
# update the trace matrix with the length of the input msa
library("Biostrings") 
fin_tracefile_mod_with_len = fin_tracefile_mod_wide

# for each family go and read the input MSA (input of main.nf) to get length of MSA (number of columns).
# Iterate over each row of the dataframe
for (i in 1:nrow(fin_tracefile_mod_with_len)) {
    # Construct the full file path using group (directory) and name (filename) to read the input msa
    # TODO make it not depend on the rerun_bla_bla direcrtory
    input_msa_path = file.path(paste0(parent_of_all_dir, "rerun_25_species_corrected_branches"), fin_tracefile_mod_with_len$family_group[i], paste0(gsub("[()]", "", fin_tracefile_mod_with_len$name[i]), ".ma"))

    # read the msa in fasta format and get the length of the first sequence (aka length of the alignment)
    msa_file = readAAStringSet(input_msa_path)
    msa_len  = length(msa_file[[1]])
    
    # add the length to the table 
    fin_tracefile_mod_with_len$length[i] = msa_len
}

In [4]:
#
# Cpu time vs Method (SM + ML and Bigtree) ratio plot of each family (x axis number columns BigMSA) 
# 

# Create a new column with the division of Big_Tree by Supermatrix_ML running times for each family (500)
# and then remove the original columns Big_Tree Supermatrix_ML
fin_tracefile_mod_ratio = fin_tracefile_mod_with_len %>%
  mutate(fold_ratio = Big_Tree / Supermatrix_ML) %>%
  select(-Big_Tree, -Supermatrix_ML)

# create a table with on x axis the length of the MSA (n columns)  and on y the fold ratio (BigTree / SM + ML) of running times. The plot will have now 500 points.
# the plot is not showed because this file was executed on a cluster and rendering of plots creates conflicts with the kernel.
p = ggplot(fin_tracefile_mod_ratio, aes(x=length, y=fold_ratio)) +
    geom_point(color = 'blue', size = 2, alpha = 0.6) +  # Add points
    theme_light() +
    labs(x = "MSA length (n# columns)",
       y = "fold ratio (BigTree / Supermatrix_ML)")
ggsave(filename="figures/CPU_time_scatterplot_msa_len_BigTree_SM_ML.png", plot=p, dpi = "retina")

ERROR: Error: object 'fin_tracefile_mod_with_len' not found


### TODO the below was just a 10 familoy trial make it go away 

In [109]:
# TODO make this into the section of custom functions.
# Function to convert to milliseconds
convert_to_ms <- function(time_str) {
    # Extract components using regular expressions
    millis = "0"
    seconds = "0"
    minutes = "0"
    if (str_detect(time_str, "\\d+(\\.\\d+)?(?=ms)")) {
        millis <- str_extract(time_str, "\\d+(\\.\\d+)?(?=ms)")
    } else {
        seconds <- str_extract(time_str, "\\d+(\\.\\d+)?(?=s)")
        minutes <- str_extract(time_str, "\\d+(\\.\\d+)?(?=m)")
    }
    
    # Convert each component to milliseconds (coercing to numeric where needed)
    seconds_ms <- as.numeric(seconds) * 1000
    minutes_ms <- as.numeric(minutes) * 60 * 1000
    millis_ms <- as.numeric(millis)
    
    # Sum all components and handle NAs by treating them as 0
    total_ms <- sum(c(seconds_ms, minutes_ms, millis_ms), na.rm = TRUE)
    return(total_ms)
}

In [120]:
for (run in run_dirs) {
    print(basename(run))

    # each main.nf output run has a trace file repoprting various information on the run itself
    tracefile_path = file.path(run, "pipeline_info/execution_trace_2024-09-27_17-39-02.txt")
    tracefile      = read.table(tracefile_path, sep = "\t", header = TRUE)

    # here we parse the trace file to retain only the processes that we want to take the running time (realtime) from
    tracefile_mod = tracefile[grep("simulated_data:run_phylo_ML_supermatrix_aln_sim|simulated_data:only_concatenate_aln_sim|simulated_data:run_phylo_ML_full_aln_sim", tracefile$name), c("name", "realtime")]

    # Use strsplit to split the 'name' column by whitespace
    split_name    = strsplit(as.character(tracefile_mod$name), " ")

    # Create new columns
    tracefile_mod$name       = sapply(split_name, `[`, 2) # First part 
    tracefile_mod$native_id  = sapply(split_name, `[`, 1) # Second part 

    # Reorder columns so 'native_id' becomes the second column
    tracefile_mod = tracefile_mod[, c("name", "native_id", "realtime")]

    # trasform from human readable time to millisecond time
    tracefile_mod$realtime = sapply(tracefile_mod$realtime, convert_to_ms)

    # transforming the above table to one that have the process names (simulated_data:run_phylo_ML_supermatrix_aln_sim ecc..) as column names, 
    # the rows will be still the family names (working as ID) and the values in each column will be the running times associated to the column name (process name) for that given family.
    tracefile_mod_wide = spread(tracefile_mod, native_id, realtime)
    
    # creating a new last column called Supermatrix_ML that has aas values the running times of concatenation plus ML tree computation (on concatenated MSA) 
    tracefile_mod_wide$Supermatrix_ML = as.numeric(tracefile_mod_wide$`simulated_data:only_concatenate_aln_sim`) + as.numeric(tracefile_mod_wide$`simulated_data:run_phylo_ML_supermatrix_aln_sim`)

    # retain only the family name, BIgTree ML running time and SM + ML running time
    tracefile_mod_wide = tracefile_mod_wide[,c(1,3,5)]

    # call the second column name Big_Tree and set their value to numeric type
    colnames(tracefile_mod_wide)[2] = "Big_Tree"
    tracefile_mod_wide$Big_Tree = as.numeric(tracefile_mod_wide$Big_Tree)

    # adding a first column that groups families by the run of the main.nf
    tracefile_mod_wide$group = rep(basename(run), 10)

    # add tables toghether raw-wise
    if(run == run_dirs[1]) {
        fin_tracefile_mod_wide = tracefile_mod_wide
    } else {
        fin_tracefile_mod_wide = rbind(fin_tracefile_mod_wide, tracefile_mod_wide)
    }
}

[1] "results"


In [121]:
# make the means of all (500) family running time for the BigTree and ML + SM. Bring them to seconds (/1000) and round the digits to 2
round(colMeans(fin_tracefile_mod_wide[, c(2, 3)])/1000,digits = 2)

# compute the median for the BigTree approach and the  ML + SM.
round(median(fin_tracefile_mod_wide$Big_Tree)/1000,digits = 2)
round(median(fin_tracefile_mod_wide$Supermatrix_ML)/1000,digits = 2)

In [122]:

fin_tracefile_mod_wide_melted           = melt(fin_tracefile_mod_wide)

fin_tracefile_mod_wide_melted$value     = fin_tracefile_mod_wide_melted$value / 1000
colnames(fin_tracefile_mod_wide_melted) = c("name","group","Method","Time")

p = ggplot(fin_tracefile_mod_wide_melted,aes(x=Method,y=Time,fill=Method)) +
    geom_boxplot(show.legend = FALSE) +
    theme_light() +
    ylab("CPU time (sec)") + xlab("Method")
ggsave(filename="CPU_time_per_method_10_fam.png", plot=p, dpi = "retina")

Using name, group as id variables

[1m[22mSaving 7 x 7 in image
