# Analysis of mouse empirical dataset

TODO add here a brief description of your project or analysis. Explain the purpose, goals, or context of the notebook here.

## Table of Contents
- [1. Chapter 1 - Introduction](#chapter-1-introduction)
- [2. Chapter 2 - R packages](#chapter-2-R-packages)
- [3. Chapter 3 - Input specification](#chapter-3-Input-specification)
- [4. Chapter 4 - RF distance summaries](#chapter-4-RF-distance-summaries)
- [4. Chapter 5 - Running times](#chapter-5-Running-times)

<a id="chapter-1-introduction"></a>

## Chapter 1 - Introduction

TODO add description of notebook and how to use it.


<a id="chapter-2-R-packages"></a>

## Chapter 2 - R packages

Here is the list of packages needed for this notebook.

In [15]:
library("phangorn")
library("rlist")
library("reshape2")
library("ggplot2")
library("geiger")
library("adephylo")
library("phytools")
library("hrbrthemes")
#library("ggtree")
#library("pheatmap")
library("gplots")
library("ape")
library("plotrix")
library("plyr")
library("dplyr")
library("rstatix")
library("grid")
library("gridExtra")
library("ggforce")
library("viridis")
library("tidyr")
library("stringr")
library("Biostrings")

<a id="chapter-3-Input-specification"></a>

## Chapter 3 - Input specification

Change the parent_of_all_dir variable to give the location of the input. All input files are expected to be under it. Otherwise change the single input location to you liking.<br>
All input files or directory paths are listed here.

In [2]:
# the location of the directory that contains all inputs for analyses 
parent_of_all_dir = "/users/cn/avignoli/1_paralogs/MOUSE_above_40_entries_clean_run/rerun_with_struct_concat/"

# The paths to all PFAM families directories
pfam_dirs_paths = paste0(parent_of_all_dir, "PF*")

# here we get all PFAM directories paths into an iterable used later on in the analysis
pfam_dirs = Sys.glob(pfam_dirs_paths)
pfam_dirs = pfam_dirs[file.info(pfam_dirs)$isdir]

# each run of main.nf has one output directory.
run_dir = Sys.glob(parent_of_all_dir)
run_dir = run_dir[file.info(run_dir)$isdir]

# reference species keyword.
ref_species = "MOUSE"

# Create the 'figures' directory if it doesn't exist. Created where this notebook is executed.
if (!dir.exists("figures")) {
    dir.create("figures")
}

# Create the 'tables' directory if it doesn't exist. Created where this notebook is executed.
if (!dir.exists("tables")) {
    dir.create("tables")
}

<a id="chapter-4-RF-distance-summaries"></a>

## Chapter 4 - RF distance summaries

TODO add description 

In [13]:
#
# Compute RF distances between supermatrix and supertree trees and bigtree ML tree without Mouse sequences.
#  supermatrix and supertree are also 6 units becouse Mouse MSAs has been excluded from the generation of the units.

# create the dataframe that will hold RF values for each tree comparison. 
# there are (6) reference trees for a family in this case. because there are (7-1) (Mouse) species and each one of them might yeld a diffrent paralog tree.
# for this reason the unit_sample trees are compared to all (6) reference trees.  
# all columns (except the rf_val column) will work as identifier for that comparison if put toghether.
df_all_rf = data.frame(
    method    = c(),
    family    = c(),
    species   = c(),
    unit      = c(),
    sample    = c(),
    rf_val    = c()
)

# create a dataframe with miscellaneuos informations about the gene family
df_fam_info = data.frame(
    family_name  = c(),
    avg_len_seq  = c(),
    sd_len_seq   = c(),
    num_units    = c(),
    num_paralogs = c(),
    num_col_ML   = c(),
    organisms    = c()
)


# the directory prefix where all analazyd files are. this is created by main.nf pipeline in empirical data mode. 
seq_prefix = "results_without_mouse_10_samples/tcoffee"

# iterate over all family specific output directories
for (family in pfam_dirs) {

    print(basename(family))

    # load the BIgtree MSA sequence names mapping scheme, aka seq 1 would be genename_BOVIN_1.
    codefile_full_aln_path = file.path(family, seq_prefix, "no_ref_full_aln.code_name")
    codefile_full_aln      = read.table(codefile_full_aln_path)

    # load only the mouse MSA sequence names mapping scheme.
    codefile_path = file.path(family, seq_prefix, "file.code_name")
    codefile      = read.table(codefile_path)

    # get the codename/acronym for the organism/species in the gene family. 
    orgs_path = file.path(family, seq_prefix, paste0(basename(family), ".orthologs_org_ids_to_concatenate"))
    orgs      = read.table(orgs_path)

    # read the paralog gene names 
    paralogs_file_path = file.path(family, paste0(basename(family), ".intersecting_genes"))
    paralogs_file      = read.table(paralogs_file_path)

    # read the fasta file that contains all sequences in the gene family, the one containing all paralogs from the (7) species, ref included.
    gene_fam_fasta_path = file.path(family, seq_prefix, paste0(basename(family), ".domain_sequences_prior_after_intersection.fasta"))
    gene_fam_fasta      = readLines(gene_fam_fasta_path)
    sequences           = gene_fam_fasta[!grepl("^>", gene_fam_fasta)]     # Extract sequences 

    #  extract from the fasta aln Bigtree ML number of columns. since from this MSA the subMSA fed to SM and ST are extracted. 
    # This allows for SM ST and BigTree to be compared in comparison to the MSA matrix size.
    no_ref_bigtree_ML_path = file.path(family, seq_prefix, paste0(basename(family), ".domain_sequences_prior_after_intersection_full.fasta_aln"))
    no_ref_bigtree_ML_aln  = readAAStringSet(no_ref_bigtree_ML_path)
    
    # calculate numbers of: average/sd len of sequence in fam, number of units (6) and number of paralog per species. 
    # All these three value are correlated: num_species * pralog_per_species = sequence_in_fam 
    avg_len_seq_in_fam = mean(nchar(sequences))
    sd_len_seq_in_fam  = sd(nchar(sequences))
    num_of_units       = length(orgs$V1) - 1
    num_paralogs       = length(paralogs_file$V1)
    num_columns_ML     = length(no_ref_bigtree_ML_aln[[1]])

    # compile the dataframe with family name and organism present
    df_fam_info = data.frame(
        family_name  = c(df_fam_info$family_name, basename(family)),      
        avg_len_seq  = c(df_fam_info$avg_len_seq, avg_len_seq_in_fam),
        sd_len_seq   = c(df_fam_info$sd_len_seq, sd_len_seq_in_fam),
        num_units    = c(df_fam_info$num_units, num_of_units),
        num_paralogs = c(df_fam_info$num_paralogs, num_paralogs),
        num_col_ML   = c(df_fam_info$num_col_ML, num_columns_ML),
        organism     = c(df_fam_info$organism, paste(orgs[[1]], collapse = ", "))     # this is just a string holding all species keywords
    )

    # Get all units trees (60 = 6 organism/units (7 - maouse) * 10 samples/statistical repeats) for the combination supermatrix method (SM) + Minimun Evolution (ME).
    # make it to a binary tree using the multi2di package
    all_units_trees_seq_path = file.path(family, seq_prefix, "all_units.nwk")
    all_units_trees_seq      = read.tree(all_units_trees_seq_path, keep.multi=TRUE)
    all_units_trees_seq      = multi2di(all_units_trees_seq)

    # do the same as above for SM + Maximum Likelihood (ML).
    all_units_raxml_trees_seq_path = file.path(family, seq_prefix, "all_units_RAxML_trees.nwk")
    all_units_raxml_trees_seq      = read.tree(all_units_raxml_trees_seq_path, keep.multi=TRUE)
    all_units_raxml_trees_seq      = multi2di(all_units_raxml_trees_seq)

    # do the same as above for Super Tree (ST) method + ME.
    all_units_superfine_trees_seq_path = file.path(family, seq_prefix, "all_units_superfine_trees.nwk")
    all_units_superfine_trees_seq      = read.tree(all_units_superfine_trees_seq_path, keep.multi=TRUE)
    all_units_superfine_trees_seq      = multi2di(all_units_superfine_trees_seq)

    # do the same as above for ST + ML.
    all_units_raxml_superfine_trees_seq_path = file.path(family, seq_prefix, "all_units_RAxML_superfine_trees.nwk")
    all_units_raxml_superfine_trees_seq      = read.tree(all_units_raxml_superfine_trees_seq_path, keep.multi=TRUE)
    all_units_raxml_superfine_trees_seq      = multi2di(all_units_raxml_superfine_trees_seq)

    # it will store all the species refernce paralog trees (6)
    all_species_raxml_subtrees_from_full_aln = c()
    
    # read the BigTree ML 
    species_raxml_tree_from_full_aln_path = file.path(family, seq_prefix, paste0("no_ref_RAxML_bestTree.", basename(family), ".domain_sequences_prior_after_intersection_full_coded_raxml.nwk"))
    species_raxml_tree_from_full_aln = read.tree(species_raxml_tree_from_full_aln_path, keep.multi=FALSE)
    
    # for each tip in the above tree rename according to the mapping coding file for ML bigtree
    for (j in 1:length(species_raxml_tree_from_full_aln$tip.label)) { 
        species_raxml_tree_from_full_aln$tip.label[j] = as.character(codefile_full_aln$V1[which(codefile_full_aln$V2 == species_raxml_tree_from_full_aln[["tip.label"]][j])])
    }

    # get the list of species/organism that arepresent in the gene family excluding the reference species (MOUSE).
    noref_species = orgs$V1[orgs$V1 != ref_species]

    # do a for loop on the species (7 -1) 
    for (species in noref_species) {
    
        # get the list of indices referring to the organism in question
        tips_to_keep = grep(species,species_raxml_tree_from_full_aln$tip.label)
    
        # extract the subtree (paralog tree) from the BigTree ML relating to the species in question
        test_subtree = keep.tip(species_raxml_tree_from_full_aln,tips_to_keep)
    
        # rename each tip sustituting the specific species codename with MOUSE, effectively going from ACAP2_AILME_1 to ACAP2_MOUSE_1
        # this makes all tips names from all species subtrees comparable altough they refer to a different sequence.
        # this is fine since only the correctness of the topology of the tree is tested later on. 
        test_subtree$tip.label = gsub(paste0("_", species, "_"), "_MOUSE_", test_subtree$tip.label)
    
        # the tips are further mapped to the C!, C2, C3, ecc.. nomenclature. going from  ACAP2_MOUSE_1 to C2
        for (j in 1:length(test_subtree$tip.label)) { 
            test_subtree$tip.label[j] = as.character(codefile$V2[which(codefile$V1 == test_subtree[["tip.label"]][j])])
        }
        
        # trasform the subtree into the ape multiPhylo class and added it to the list with all reference paralog trees.
        test_subtree = as.multiPhylo(test_subtree)
        all_species_raxml_subtrees_from_full_aln = c(all_species_raxml_subtrees_from_full_aln, test_subtree)
    }

    #
    # SM + ME rf computation block 
    # 
    
    # iterate over all possible species reference trees
    for (r in 1:length(all_species_raxml_subtrees_from_full_aln)) {
    
        # RF to work needs all trees it needs to compare into a single list (class) and then it will automaticly do all against all comparisons.
        # here we are comparing all unit_sample (60) family SM + ME trees with the family (7) species paralog reference trees
        trees_to_compare_ME_SM         = c(all_species_raxml_subtrees_from_full_aln[r], all_units_trees_seq)
        class(trees_to_compare_ME_SM)  = "multiPhylo"
        rf_trees_to_compare_ME_SM      = RF.dist(trees_to_compare_ME_SM, normalize=TRUE)
    
        # transform Rf.dist output format to a matrix and take the first row without the first element. 
        # Basically retrieve only the comparisons between reference tree and each unit tree
        rf_ME_SM_comparisons                = data.matrix(rf_trees_to_compare_ME_SM)[1, -1]
    
        # for loop over the created vector to correctly update the dataframe. 
        # the (60) values in the above vector  are from different units and samples in an ordered manner, the for loop is just to "count" correctly.
        # the dataframe will be updated in chuncks of 10 
        for (u in seq(10, 60, by=10)) {
        
            # update the df with all relevant info. since the rf vector is chuncked in pieces of ten all other things have to be repeated accordingly.
            df_all_rf = data.frame(
                method    = c(df_all_rf$method, rep("SM_ME", 10)),
                family    = c(df_all_rf$family, rep(basename(family), 10)),           # use family name
                species   = c(df_all_rf$reference, rep(noref_species[r] , 10)),       # for the reference tree get the species which it is from
                unit      = c(df_all_rf$unit, rep((u/10), 10)),
                sample    = c(df_all_rf$sample, c(1, 2, 3, 4, 5, 6 ,7 ,8, 9, 10)),
                rf_val    = c(df_all_rf$rf_val, rf_ME_SM_comparisons[(u-9):u])        # get all sample from the same unit
            )
        }
    }

    #
    # SM + ML rf computation block 
    # 
    
    # repeat the previous block for ML as well
    for (r in 1:length(all_species_raxml_subtrees_from_full_aln)) {
        trees_to_compare_ML_SM         = c(all_species_raxml_subtrees_from_full_aln[r], all_units_raxml_trees_seq)
        class(trees_to_compare_ML_SM)  = "multiPhylo"
        rf_trees_to_compare_ML_SM      = RF.dist(trees_to_compare_ML_SM, normalize=TRUE)
        rf_ML_SM_comparisons           = data.matrix(rf_trees_to_compare_ML_SM)[1, -1]
        for (u in seq(10, 60, by=10)) {
            df_all_rf = data.frame(
                method    = c(df_all_rf$method, rep("SM_ML", 10)),
                family    = c(df_all_rf$family, rep(basename(family), 10)),           # use family name
                species   = c(df_all_rf$reference, rep(noref_species[r] , 10)),       # for the reference tree get the species which it is from
                unit      = c(df_all_rf$unit, rep((u/10), 10)),
                sample    = c(df_all_rf$sample, c(1, 2, 3, 4, 5, 6 ,7 ,8, 9, 10)),
                rf_val    = c(df_all_rf$rf_val, rf_ML_SM_comparisons[(u-9):u])        # get all sample from the same unit
            )
        }
    }
    
    #
    # ST + ME rf computation block 
    # 
    
    # repeat the previous block for super tree as well
    for (r in 1:length(all_species_raxml_subtrees_from_full_aln)) {
        trees_to_compare_ME_ST         = c(all_species_raxml_subtrees_from_full_aln[r], all_units_superfine_trees_seq)
        class(trees_to_compare_ME_ST)  = "multiPhylo"
        rf_trees_to_compare_ME_ST      = RF.dist(trees_to_compare_ME_ST, normalize=TRUE)
        rf_ME_ST_comparisons           = data.matrix(rf_trees_to_compare_ME_ST)[1, -1]
        for (u in seq(10, 60, by=10)) {
            df_all_rf = data.frame(
                method    = c(df_all_rf$method, rep("ST_ME", 10)),
                family    = c(df_all_rf$family, rep(basename(family), 10)),           # use family name
                species   = c(df_all_rf$reference, rep(noref_species[r] , 10)),             # for the reference tree get the species which it is from
                unit      = c(df_all_rf$unit, rep((u/10), 10)),
                sample    = c(df_all_rf$sample, c(1, 2, 3, 4, 5, 6 ,7 ,8, 9, 10)),
                rf_val    = c(df_all_rf$rf_val, rf_ME_ST_comparisons[(u-9):u])        # get all sample from the same unit
            )
        }
    }

    #
    # ST + ML rf computation block 
    # 
    
    # repeat the previous block for ML as well
    for (r in 1:length(all_species_raxml_subtrees_from_full_aln)) {
        trees_to_compare_ML_ST         = c(all_species_raxml_subtrees_from_full_aln[r], all_units_raxml_superfine_trees_seq)
        class(trees_to_compare_ML_ST)  = "multiPhylo"
        rf_trees_to_compare_ML_ST      = RF.dist(trees_to_compare_ML_ST, normalize=TRUE)
        rf_ML_ST_comparisons           = data.matrix(rf_trees_to_compare_ML_ST)[1, -1]
        for (u in seq(10, 60, by=10)) {
            df_all_rf = data.frame(
                method    = c(df_all_rf$method, rep("ST_ML", 10)),
                family    = c(df_all_rf$family, rep(basename(family), 10)),           # use family name
                species   = c(df_all_rf$reference, rep(noref_species[r] , 10)),       # for the reference tree get the species which it is from
                unit      = c(df_all_rf$unit, rep((u/10), 10)),
                sample    = c(df_all_rf$sample, c(1, 2, 3, 4, 5, 6 ,7 ,8, 9, 10)),
                rf_val    = c(df_all_rf$rf_val, rf_ML_ST_comparisons[(u-9):u])        # get all sample from the same unit
            )
        }
    }
}

# save the dataframes
write.table(df_fam_info, file = "tables/RF_empirical_set_fam_info.tsv", sep="\t", quote = FALSE, row.names = FALSE)
write.table(df_all_rf, file = "tables/RF_empirical_set_between_Bigtree_ML_no_ref_supermatrix_and_supertree.tsv", quote = FALSE, row.names = FALSE)

[1] "PF00004"
[1] "PF00019"
[1] "PF00022"
[1] "PF00025"
[1] "PF00041"
[1] "PF00059"
[1] "PF00061"
[1] "PF00092"
[1] "PF00102"
[1] "PF00104"
[1] "PF00106"
[1] "PF00122"
[1] "PF00134"
[1] "PF00167"
[1] "PF00168"
[1] "PF00178"
[1] "PF00179"
[1] "PF00250"
[1] "PF00307"
[1] "PF00373"
[1] "PF00531"
[1] "PF00566"
[1] "PF00595"
[1] "PF00615"
[1] "PF00620"
[1] "PF00621"
[1] "PF00622"
[1] "PF00640"
[1] "PF00777"
[1] "PF00782"
[1] "PF00787"
[1] "PF00788"
[1] "PF00822"
[1] "PF01392"
[1] "PF01412"
[1] "PF01421"
[1] "PF02214"
[1] "PF02931"
[1] "PF07679"
[1] "PF08205"
[1] "PF12796"


In [28]:
#
# Table 2 SM and ST VS Big-Tree ML no ref. group  RF by method and unit.
#

# the directory prefix where all analazyd files are. this is created by main.nf pipeline in empirical data mode. 
seq_prefix = "results_without_mouse_10_samples/tcoffee"

# define how many units there are 
orgs_path      = file.path(pfam_dirs[[1]], seq_prefix, paste0(basename(pfam_dirs[[1]]), ".orthologs_org_ids_to_concatenate"))
orgs           = read.table(orgs_path)
num_of_species = length(orgs$V1)
num_of_units   = length(orgs$V1) - 1

# read the table compiled in chapter 4. So this cell does not depend on that one to run, it just needs the table file.
df_new = read.table("tables/RF_empirical_set_between_Bigtree_ML_no_ref_supermatrix_and_supertree.tsv", header=T)

# Group data by method and unit, then calculate mean and standard deviation for all families 
summary_table = df_new %>%
  group_by(method, unit) %>%
  summarise(
    mean_rf_val = mean(rf_val, na.rm = TRUE),
    sd_rf_val = sd(rf_val, na.rm = TRUE)
  ) %>%
  ungroup()

# print the unit rf table and save it
summary_table
write.table(summary_table, file = "tables/RF_empirical_set_per_unit_SM_ST_ME_ML_vs_BigTreeML_no_ref.tsv", quote = FALSE, row.names = FALSE)

[1m[22m`summarise()` has grouped output by 'method'. You can override using the
`.groups` argument.


method,unit,mean_rf_val,sd_rf_val
<chr>,<int>,<dbl>,<dbl>
SM_ME,1,0.4813789,0.2188298
SM_ME,2,0.420916,0.2167914
SM_ME,3,0.3859349,0.2183725
SM_ME,4,0.3668104,0.2227733
SM_ME,5,0.3525344,0.2257878
SM_ME,6,0.347701,0.2227963
SM_ML,1,0.4606626,0.2216494
SM_ML,2,0.3563474,0.2267008
SM_ML,3,0.3208072,0.2243196
SM_ML,4,0.2748926,0.2160326


In [23]:
#
# Fig 5a Supermatrix-ME and Supermatrix-ML VS Big-Tree ML without ref (no_ref) (aka MOUSE) paralog trees projections(aka species subtrees extracted).
#

# read the table compiled in chapter 4 cell above. So this cell does not depend on that one to run, it just needs the table file.
df = read.table("tables/RF_empirical_set_per_unit_SM_ST_ME_ML_vs_BigTreeML_no_ref.tsv", header=T)

# retrieve only Suspermatrix method rows.
df_figure5a = subset(df, grepl("SM", method))

# Add prefix "unit_" to the "unit" column
df_figure5a$unit = paste0("unit_", df_figure5a$unit)

# set limits for plot
options(repr.plot.width = 5, repr.plot.height = 5, repr.plot.res = 300)

# Define custom colors for each Method
custom_colors = c("SM_ME" = "#2F739B",   # Dark blue
                  "SM_ML" = "#D05159")   # Red

# plot the table 
unit_rf_5a = ggplot(data=df_figure5a, aes(x=unit, y=mean_rf_val, group=method, color=method)) +
    geom_line() + 
    geom_point() + 
    theme_light() + 
    ylab("Normalized RF") + xlab("Topology") +
    theme(
        axis.text.x = element_text(angle = 45, hjust = 1, size=8),
        axis.text.y = element_text(size=8),
        axis.title = element_text(face = "bold"),
        legend.position="right"
        ) + 
    scale_y_continuous(limits = c(0, 0.51)) +
    # Custom legend labels and apply custom colors
    scale_color_manual(
        name = "Method", 
        labels = c("SM_ME" = "Supermatrix-ME", "SM_ML" = "Supermatrix-ML"), # Custom legend labels
        values = custom_colors
    )

ggsave(filename="figures/RF_empirical_set_lineplot_SM_vs_ML_Bigtree_noref_figure5a.png", plot=unit_rf_5a, dpi = "retina", width = 5, height = 5)

In [24]:
#
# Fig 5b Supertree-ME and Supertree-ML VS Big-Tree ML without ref (no_ref) (aka MOUSE) paralog trees projections(aka species subtrees extracted).
#

# read the table compiled in chapter 4 two cell above. So this cell does not depend on that one to run, it just needs the table file.
df = read.table("tables/RF_empirical_set_per_unit_SM_ST_ME_ML_vs_BigTreeML_no_ref.tsv", header=T)

# retrieve only Supertree method rows.
df_figure5b = subset(df, grepl("ST", method))

# Add prefix "unit_" to the "unit" column
df_figure5b$unit = paste0("unit_", df_figure5b$unit)

# set limits for plot
options(repr.plot.width = 5, repr.plot.height = 5, repr.plot.res = 300)

# Define custom colors for each Method
custom_colors = c("ST_ME" = "#7CB6E2",     # Light blue 
                  "ST_ML" = "#E5A23D")     # Orange/yellow

# plot the table 
unit_rf_5b = ggplot(data=df_figure5b, aes(x=unit, y=mean_rf_val, group=method, color=method)) +
    geom_line() + 
    geom_point() + 
    theme_light() + 
    ylab("Normalized RF") + xlab("Topology") +
    theme(
        axis.text.x = element_text(angle = 45, hjust = 1, size=8),
        axis.text.y = element_text(size=8),
        axis.title = element_text(face = "bold"),
        legend.position="right"
        ) + 
    scale_y_continuous(limits = c(0, 0.51)) +
    # Custom legend labels and apply custom colors
    scale_color_manual(
        name = "Method", 
        labels = c("ST_ME" = "Supertree-ME", "ST_ML" = "Supertree-ML"), # Custom legend labels
        values = custom_colors
    )

ggsave(filename="figures/RF_empirical_set_lineplot_ST_vs_ML_Bigtree_noref_figure5b.png", plot=unit_rf_5b, dpi = "retina", width = 5, height = 5)

### TODO decide which of this is in figure 5 

In [42]:
#
# SM and ST VS Big-Tree ML no ref. group by method and num paralogs per scpecies in gene family.
#

# the directory prefix where all analazyd files are. this is created by main.nf pipeline in empirical data mode. 
seq_prefix = "results_without_mouse_10_samples/tcoffee"

# define how many units there are 
orgs_path      = file.path(pfam_dirs[[1]], seq_prefix, paste0(basename(pfam_dirs[[1]]), ".orthologs_org_ids_to_concatenate"))
orgs           = read.table(orgs_path)
num_of_species = length(orgs$V1)
num_of_units   = length(orgs$V1) - 1

# read the tables compiled in chapter 4. So this cell does not depend on that one to run, it just needs the table file.
df_rf   = read.table("tables/RF_empirical_set_between_Bigtree_ML_no_ref_supermatrix_and_supertree.tsv", header=T)
df_info = read.table("tables/RF_empirical_set_fam_info.tsv", sep="\t", header=T)

# Group data by method, family. then calculate mean and standard deviation for all families.
family_table = df_rf %>%
  group_by(family, method) %>%
  summarise(
    mean_rf_val = mean(rf_val, na.rm = TRUE),
    sd_rf_val = sd(rf_val, na.rm = TRUE)
  ) %>%
  ungroup()

# Repeat each value in  df_info$num_paralogs 4 times and add it to the family df. 4 = SM/ST + ML/ME
# this can be done like this becouse both df are ordered lexographically for the PFAM family ID column. 
repeated_values           = rep(df_info$num_paralogs, each = 4)
family_table$num_paralogs = repeated_values

# print the unit rf table and save it
family_table
write.table(family_table, file = "tables/RF_empirical_set_per_family_SM_ST_ME_ML_vs_BigTreeML_no_ref.tsv", quote = FALSE, row.names = FALSE)

[1m[22m`summarise()` has grouped output by 'family'. You can override using the
`.groups` argument.


family,method,mean_rf_val,sd_rf_val,num_paralogs
<chr>,<chr>,<dbl>,<dbl>,<int>
PF00004,SM_ME,0.6197917,0.05018897,19
PF00004,SM_ML,0.2572917,0.20347724,19
PF00004,ST_ME,0.6270833,0.07576789,19
PF00004,ST_ML,0.4354167,0.15071499,19
PF00019,SM_ME,0.2912281,0.08134700,22
PF00019,SM_ML,0.1649123,0.12710014,22
PF00019,ST_ME,0.3245614,0.10099850,22
PF00019,ST_ML,0.2570175,0.10412447,22
PF00022,SM_ME,0.2400000,0.09708854,13
PF00022,SM_ML,0.2033333,0.13799197,13


In [51]:
#
# Fig 5a Supermatrix-ME and Supermatrix-ML VS Big-Tree ML without ref (no_ref) (aka MOUSE) paralog trees projections(aka species subtrees extracted).
#

# read the table compiled in chapter 4 cell above. So this cell does not depend on that one to run, it just needs the table file.
df = read.table("tables/RF_empirical_set_per_family_SM_ST_ME_ML_vs_BigTreeML_no_ref.tsv", header=T)

# retrieve only Suspermatrix method rows.
df_figure5a = subset(df, grepl("SM", method))

# Add prefix "unit_" to the "unit" column
df_figure5a$unit = paste0("unit_", df_figure5a$unit)

# set limits for plot
options(repr.plot.width = 5, repr.plot.height = 5, repr.plot.res = 300)

# Define custom colors for each Method
custom_colors = c("SM_ME" = "#2F739B",   # Dark blue
                  "SM_ML" = "#D05159")   # Red

# plot the table 
unit_rf_5a = ggplot(data=df_figure5a, aes(x=factor(num_paralogs), y=mean_rf_val, fill=method)) +
    geom_boxplot() + 
    theme_light() + 
    ylab("Normalized RF") + xlab("Num. paralogs per species") +
    theme(
        axis.text.x = element_text(angle = 45, hjust = 1, size=8),
        axis.text.y = element_text(size=8),
        axis.title = element_text(face = "bold"),
        legend.position="top"
        ) + 
    #scale_y_continuous(limits = c(0, 0.51)) +
    # Custom legend labels and apply custom colors
    scale_color_manual(
        name = "Method", 
        labels = c("SM_ME" = "Supermatrix-ME", "SM_ML" = "Supermatrix-ML"), # Custom legend labels
        values = custom_colors
    )

ggsave(filename="figures/RF_empirical_set_num_paralogs_boxplot_SM_vs_ML_Bigtree_noref_figure5a.png", plot=unit_rf_5a, dpi = "retina", width = 5, height = 5)

“[1m[22mNo shared levels found between `names(values)` of the manual scale and the
data's [32mcolour[39m values.”


In [52]:
#
# Fig 5b Supertree-ME and Supertree-ML VS Big-Tree ML without ref (no_ref) (aka MOUSE) paralog trees projections(aka species subtrees extracted).
#

# read the table compiled in chapter 4 cell above. So this cell does not depend on that one to run, it just needs the table file.
df = read.table("tables/RF_empirical_set_per_family_SM_ST_ME_ML_vs_BigTreeML_no_ref.tsv", header=T)

# retrieve only Suspertree method rows.
df_figure5b = subset(df, grepl("ST", method))

# Add prefix "unit_" to the "unit" column
df_figure5b$unit = paste0("unit_", df_figure5b$unit)

# set limits for plot
options(repr.plot.width = 5, repr.plot.height = 5, repr.plot.res = 300)

# Define custom colors for each Method
custom_colors = c("ST_ME" = "#7CB6E2",     # Light blue 
                  "ST_ML" = "#E5A23D")     # Orange/yellow

# plot the table 
unit_rf_5b = ggplot(data=df_figure5b, aes(x=factor(num_paralogs), y=mean_rf_val, fill=method)) +
    geom_boxplot() + 
    theme_light() + 
    ylab("Normalized RF") + xlab("Num. paralogs per species") +
    theme(
        axis.text.x = element_text(angle = 45, hjust = 1, size=8),
        axis.text.y = element_text(size=8),
        axis.title = element_text(face = "bold"),
        legend.position="top"
        ) + 
    #scale_y_continuous(limits = c(0, 0.51)) +
    # Custom legend labels and apply custom colors
    scale_color_manual(
        name = "Method", 
        labels = c("ST_ME" = "Supertree-ME", "ST_ML" = "Supertree-ML"), # Custom legend labels
        values = custom_colors
    )

ggsave(filename="figures/RF_empirical_set_num_paralogs_boxplot_ST_vs_ML_Bigtree_noref_figure5b.png", plot=unit_rf_5b, dpi = "retina", width = 5, height = 5)

“[1m[22mNo shared levels found between `names(values)` of the manual scale and the
data's [32mcolour[39m values.”


<a id="chapter-5-Running-times"></a>

## Chapter 5 - Running times

TODO add description 

In [11]:
# TODO make this into the section of custom functions.
# Function to convert to milliseconds
convert_to_ms <- function(time_str) {
    # Extract components using regular expressions
    millis = "0"
    seconds = "0"
    minutes = "0"
    if (str_detect(time_str, "\\d+(\\.\\d+)?(?=ms)")) {
        millis = str_extract(time_str, "\\d+(\\.\\d+)?(?=ms)")
    } else {
        seconds = str_extract(time_str, "\\d+(\\.\\d+)?(?=s)")
        minutes = str_extract(time_str, "\\d+(\\.\\d+)?(?=m)")
    }
    
    # Convert each component to milliseconds (coercing to numeric where needed)
    seconds_ms = as.numeric(seconds) * 1000
    minutes_ms = as.numeric(minutes) * 60 * 1000
    millis_ms = as.numeric(millis)
    
    # Sum all components and handle NAs by treating them as 0
    total_ms = sum(c(seconds_ms, minutes_ms, millis_ms), na.rm = TRUE)
    return(total_ms)
}

In [12]:
# each main.nf output run has a trace file repoprting various information on the run itself
tracefile_path = file.path(run_dir, "trace_cpu_time.txt")
tracefile      = read.table(tracefile_path, sep = "\t", header = TRUE)

# MSA computation step is not included in the running time only the step necessary to go from MSA to paralog tree.
# the list of process names to extract from th trace file
process_names = paste(
    "empirical_data:run_phylo_ML_full_aln_emp",                     # BigTree ML with ref (41x)
    "empirical_data:extract_species_submsa_ML_emp",
    "empirical_data:run_phylo_ME_full_aln_emp",                     # BigTree ME with ref (41x)
    "empirical_data:extract_species_submsa_ME_emp",
    "empirical_data:superfine_emp \\(",                             # both BigTree ML and BigTree ME with ref (summation) (82x)
    "empirical_data:run_phylo_ML_full_aln_emp_no_ref",              # BigTree ML without ref (41x)
    "empirical_data:extract_species_submsa_ML_emp_no_ref",
    "empirical_data:run_phylo_ME_full_aln_emp_no_ref",              # BigTree ME without ref (41x)
    "empirical_data:extract_species_submsa_ME_emp_no_ref",
    "empirical_data:superfine_emp_no_ref",                          # both BigTree ML and BigTree ME without ref (summation) (82x)
    "empirical_data:run_phylo_ML_supermatrix_aln_emp",              # supermatrix ML (41x)
    "empirical_data:run_phylo_ME_supermatrix_aln_emp",              # supermatrix ME (41x)
    "empirical_data:only_concatenate_aln_emp",                      # same for supermatrix ML and supermatrix ME (41x)
    "empirical_data:run_phylo_ML_supertree_aln_emp",                # supertree ML (246x)
    "empirical_data:run_phylo_ME_supertree_aln_emp",                # supertree ME (246x)
    "empirical_data:superfine_supertree_emp",                       # both supertree ML and supertree ME (summation) (82x)
    sep="|")

# here we parse the trace file to retain only the processes that we want to take the running time (realtime) from
tracefile_mod = tracefile[grep(process_names, tracefile$name), c("name", "realtime", "status")]

# among these filter out those that have not completed, meaning those processes that finished with an error and were retried.
completed_tracefile = tracefile_mod[tracefile_mod$status %in% c("COMPLETED", "CACHED"), ]

# trasform from human readable running time to millisecond time
completed_tracefile$realtime = sapply(completed_tracefile$realtime, convert_to_ms)

# Use regex to extract PFAM family names
family_names = str_extract(completed_tracefile$name, "PF\\d{5}")

# Get unique family names and sort human-readably (41)
unique_family_names = str_sort(unique(family_names), numeric = TRUE)

# Create a new dataframe with row name the family name and the additional columns filled with zeroes for each method
running_time_df = data.frame(
    family_name       = unique_family_names,
    BigTree_ML        = 0,                        # with reference specie as well (7)
    BigTree_ME        = 0,
    BigTree_ML_no_ref = 0,                        # without reference species (6)
    BigTree_ME_no_ref = 0,
    Supermatrix_ML    = 0,                        # on unit6 (not using ref MSA)
    Supermatrix_ME    = 0,
    Supertree_ML      = 0,                        # on unit6 (not using ref MSA)
    Supertree_ME      = 0,
    row.names         = unique_family_names
)

# now all processes running times have to be processed and added accordingly into the above df, that has as row the families id and as columns the methods names.
# there are many operation to do based on which method and they reflect the nomenclature and the logic of the main.nf pipeline for each method this is why a for loop is necessary.
for (line_num in 1:nrow(completed_tracefile)) {

    # get the process name of the present row
    process_name = str_extract(completed_tracefile[line_num, ]$name, "(?<=:)[^ ]+")

    # get the family name on which the process run
    tmp       = str_extract(completed_tracefile[line_num, ]$name, "\\(([^)]+)\\)")
    family_id = str_remove_all(tmp, "[()]")

    # the if to correctly handle the work to do on each process are in order of how many instance there are of that process.
    if (process_name == "run_phylo_ML_supertree_aln_emp") {
        # handle the case of Supertree - ML
        # first get the process for the creation of the single specie paralog tree
       
        # add the running time to the appropriate df column.
        # all supertree running times for the same families are summed toghether (25 values). as if there was no empbarassing parallelization happening.
        running_time_df[family_id, ]$Supertree_ML = running_time_df[family_id, ]$Supertree_ML + completed_tracefile[line_num, ]$realtime
        
    } else if (process_name == "run_phylo_ME_supertree_aln_emp") {
        # handle the case of Supertree - ME
        # first get the process for the creation of the single specie paralog tree, do as above
        running_time_df[family_id, ]$Supertree_ME = running_time_df[family_id, ]$Supertree_ME + completed_tracefile[line_num, ]$realtime
        
    } else if (process_name == "superfine_supertree_emp" ) {
        # add the superfine step to both supertree ME and ML 

        # distiguish between ME and ML base on the snippet present in the family name. example: avg_branchlen_0.7_protein_family_001.sub - ME
        method_code = sub(".*\\(.* - (.*)\\)", "\\1", completed_tracefile[line_num, ]$name)

        # create the right column name to update the running_time_df
        method = paste0("Supertree_", method_code)

        # get the family name now
        real_family_name = str_extract(completed_tracefile[line_num, ]$name, "PF\\d{5}")

        # sum this running time to the correct supertree method as well
        running_time_df[real_family_name, ][[method]] = running_time_df[real_family_name, ][[method]] + completed_tracefile[line_num, ]$realtime
        
    } else if(process_name == "run_phylo_ML_supermatrix_aln_emp") {
        # handle supermatrix ML
       
        # add the running time to create the paralog tree
        running_time_df[family_id, ]$Supermatrix_ML = running_time_df[family_id, ]$Supermatrix_ML + completed_tracefile[line_num, ]$realtime
        
    } else if (process_name == "run_phylo_ME_supermatrix_aln_emp") {
        # handle supermatrix ME

        running_time_df[family_id, ]$Supermatrix_ME = running_time_df[family_id, ]$Supermatrix_ME + completed_tracefile[line_num, ]$realtime
    } else if (process_name == "only_concatenate_aln_emp") {
        # add the running time of the step of concatenating submsa into a single MSA creating the "supermatrix" to both supermatrix ME and ML

        running_time_df[family_id, ]$Supermatrix_ML = running_time_df[family_id, ]$Supermatrix_ML + completed_tracefile[line_num, ]$realtime
        running_time_df[family_id, ]$Supermatrix_ME = running_time_df[family_id, ]$Supermatrix_ME + completed_tracefile[line_num, ]$realtime
    
    } else if(process_name == "run_phylo_ML_full_aln_emp" | process_name == "extract_species_submsa_ML_emp") {
        # handle BigTree ML

        # add the running time of both creating the tree and extracting the species paralogs from it
        running_time_df[family_id, ]$BigTree_ML = running_time_df[family_id, ]$BigTree_ML + completed_tracefile[line_num, ]$realtime
        
    } else if(process_name == "run_phylo_ME_full_aln_emp" | process_name == "extract_species_submsa_ME_emp") {
        # handle  Bigtree ME

        running_time_df[family_id, ]$BigTree_ME = running_time_df[family_id, ]$BigTree_ME + completed_tracefile[line_num, ]$realtime
        
    } else if(process_name == "superfine_emp"){
        # add the superfine step to both bigtree ME and ML. this is used to merge toghether all paralog trees into a single one like supertrree does.

        # distiguish between ME and ML base on the snippet present in the family name. example: avg_branchlen_0.7_protein_family_001.sub - ME
        method_code = sub(".*\\(.* - (.*)\\)", "\\1", completed_tracefile[line_num, ]$name)

        # create the right column name to update the running_time_df
        method = paste0("BigTree_", method_code)

        # get the family name now
        real_family_name = str_extract(completed_tracefile[line_num, ]$name, "PF\\d{5}")

        # sum this running time to the correct supertree method as well
        running_time_df[real_family_name, ][[method]] = running_time_df[real_family_name, ][[method]] + completed_tracefile[line_num, ]$realtime

    } else if(process_name == "run_phylo_ML_full_aln_emp_no_ref" | process_name == "extract_species_submsa_ML_emp_no_ref") {
        # handle BigTree ML without ref

        # add the running time of both creating the tree and extracting the species paralogs from it
        running_time_df[family_id, ]$BigTree_ML_no_ref = running_time_df[family_id, ]$BigTree_ML_no_ref + completed_tracefile[line_num, ]$realtime
        
    } else if(process_name == "run_phylo_ME_full_aln_emp_no_ref" | process_name == "extract_species_submsa_ME_emp_no_ref") {
        # handle  Bigtree ME without ref

        running_time_df[family_id, ]$BigTree_ME_no_ref = running_time_df[family_id, ]$BigTree_ME_no_ref + completed_tracefile[line_num, ]$realtime
        
    } else if(process_name == "superfine_emp_no_ref"){
        # add the superfine step to both bigtree ME and ML. this is used to merge toghether all paralog trees into a single one like supertrree does.

        # distiguish between ME and ML base on the snippet present in the family name. example: avg_branchlen_0.7_protein_family_001.sub - ME
        method_code = sub(".*\\(.* - (.*)\\)", "\\1", completed_tracefile[line_num, ]$name)

        # create the right column name to update the running_time_df
        method = paste0("BigTree_", method_code, "_no_ref")

        # get the family name now
        real_family_name = str_extract(completed_tracefile[line_num, ]$name, "PF\\d{5}")

        # sum this running time to the correct supertree method as well
        running_time_df[real_family_name, ][[method]] = running_time_df[real_family_name, ][[method]] + completed_tracefile[line_num, ]$realtime
        
    } else {
        print("WARNING: unrecognized process name")
    }
}

# save the dataframe
write.table(running_time_df, file = "tables/RF_empirical_set_all_methods_running_time_per_family.tsv", sep="\t", quote = FALSE, row.names = FALSE)

In [21]:
#
# Fig 5c all methods ML running time lineplot. for SM and ST it is done on unit 6 with no shuffle. For Bigtree is with/without ref species.
# x axis is average family sequence length. Since ios the closest thing that can be compared across all methods.

# read the table compiled in chapter 4. So this cell does not depend on that one to run, it just needs the table file.
df_fam_info = read.table("tables/RF_empirical_set_fam_info.tsv", sep="\t", header=T)

# read the table compiled in chapter 5 one cell above.
df_running_time_ML = read.table("tables/RF_empirical_set_all_methods_running_time_per_family.tsv", sep="\t", header=T)

# multiply the average length per family and nuumber of paralogs per family columns. Getting to a proxy for dimension of matrix MSA.
df_running_time_ML["Product"] = df_fam_info$num_col_ML * df_fam_info$num_paralogs * df_fam_info$num_units

# filter out BigTree with ref species. it would not be a fair comparison as SM and ST are done on (6) units.
df_running_time_ML = df_running_time_ML[, c("family_name", "BigTree_ML_no_ref", "Supermatrix_ML", "Supertree_ML", "Product")]

# Rename the BigTree column and divide milliseconds to seconds for running times.
colnames(df_running_time_ML)[colnames(df_running_time_ML) == "BigTree_ML_no_ref"] = "BigTree_ML"
df_running_time_ML$BigTree_ML     = df_running_time_ML$BigTree_ML / 1000
df_running_time_ML$Supermatrix_ML = df_running_time_ML$Supermatrix_ML / 1000
df_running_time_ML$Supertree_ML   = df_running_time_ML$Supertree_ML / 1000

# Fit linear regression model for BigTree ML
model_Bigtree_ML     = lm(BigTree_ML ~ Product, data = df_running_time_ML)
slope_Bigtree_ML     = coef(model_Bigtree_ML)[2]  # Extract slope
intercept_Bigtree_ML = coef(model_Bigtree_ML)[1]  # Extract intercept

# Fit linear regression model for SM + ML
model_SM_ML     = lm(Supermatrix_ML ~ Product, data = df_running_time_ML)
slope_SM_ML     = coef(model_SM_ML)[2]  # Extract slope
intercept_SM_ML = coef(model_SM_ML)[1]  # Extract intercept

# Fit linear regression model for ST + ML
model_ST_ML     = lm(Supertree_ML ~ Product, data = df_running_time_ML)
slope_ST_ML     = coef(model_ST_ML)[2]  # Extract slope
intercept_ST_ML = coef(model_ST_ML)[1]  # Extract intercept

# print the slopes of this fitted lines. they have a meaning of running time growing factor.
slope_Bigtree_ML
slope_SM_ML
slope_ST_ML

# It takes a table of all families as rows (41) and it goes to a table of 246 (41*6 there are 6 methods). 
# and the column names of the methods are now a value of a new column called variable, so that is renamed to method, as well as the new column holding the running times is called value.
df_running_time_ML_metled = melt(df_running_time_ML, id = c("family_name", "Product"))
colnames(df_running_time_ML_metled) = c("name", "Product", "Method", "Time")

# Determine the x-axis range. used to acctually plot the intercept on the plot since
x_min = min(df_running_time_ML_metled$Product)
x_max = max(df_running_time_ML_metled$Product)

# plot the table 
unit_rf_5c = ggplot(data=df_running_time_ML_metled, aes(x=Product, y=Time, group=Method, color=Method)) +
    geom_point() +
    geom_segment(x = x_min, 
                 xend = x_max, 
                 y = intercept_Bigtree_ML + slope_Bigtree_ML * x_min, 
                 yend = intercept_Bigtree_ML + slope_Bigtree_ML * x_max, 
                 color = "red", 
                 linetype = "dashed", size = 0.8) +
    geom_segment(x = x_min, 
                 xend = x_max, 
                 y = intercept_SM_ML + slope_SM_ML * x_min, 
                 yend = intercept_SM_ML + slope_SM_ML * x_max, 
                 color = "dark green", 
                 linetype = "dashed", size = 0.8) +
    geom_segment(x = x_min, 
                 xend = x_max, 
                 y = intercept_ST_ML + slope_ST_ML * x_min, 
                 yend = intercept_ST_ML + slope_ST_ML * x_max, 
                 color = "blue", 
                 linetype = "dashed", size = 0.8) +
    theme_light() + 
    ylab("CPU time (sec)") +
    xlab("num. columns * num. sequences") +
    theme(
        axis.text.x = element_text(angle = 45, hjust = 1, size=8),
        axis.text.y = element_text(size=8),
        axis.title = element_text(face = "bold"),
        legend.position="right"
        ) 
    #scale_y_continuous(limits = c(0, 0.51)) +

ggsave(filename="figures/CPU_time_per_family_empirical_set_lineplot_ML_SM_ST_Bigtree_noref_figure5c.png", plot=unit_rf_5c, dpi = "retina", width = 5, height = 5)

In [22]:
#
# Fig 5d all methods ME running time lineplot. for SM and ST it is done on unit 6 with no shuffle. For Bigtree is with/without ref species.
# x axis is average family sequence length. Since ios the closest thing that can be compared across all methods.

# read the table compiled in chapter 4. So this cell does not depend on that one to run, it just needs the table file.
df_fam_info = read.table("tables/RF_empirical_set_fam_info.tsv", sep="\t", header=T)

# read the table compiled in chapter 5 one cell above.
df_running_time_ME = read.table("tables/RF_empirical_set_all_methods_running_time_per_family.tsv", sep="\t", header=T)

# multiply the average length per family and nuumber of paralogs per family columns. Getting to a proxy for dimension of matrix MSA.
df_running_time_ME["Product"] = df_fam_info$num_col_ML * df_fam_info$num_paralogs * df_fam_info$num_units

# filter out BigTree with ref species. it would nbot be a fair comparison as SM and ST are done on (6) units.
df_running_time_ME = df_running_time_ME[, c("family_name", "BigTree_ME_no_ref", "Supermatrix_ME", "Supertree_ME", "Product")]

# Rename the BigTree column and divide milliseconds to seconds for running times.
colnames(df_running_time_ME)[colnames(df_running_time_ME) == "BigTree_ME_no_ref"] = "BigTree_ME"
df_running_time_ME$BigTree_ME     = df_running_time_ME$BigTree_ME / 1000
df_running_time_ME$Supermatrix_ME = df_running_time_ME$Supermatrix_ME / 1000
df_running_time_ME$Supertree_ME   = df_running_time_ME$Supertree_ME / 1000

# Fit linear regression model for BigTree ME
model_Bigtree_ME     = lm(BigTree_ME ~ Product, data = df_running_time_ME)
slope_Bigtree_ME     = coef(model_Bigtree_ME)[2]  # Extract slope
intercept_Bigtree_ME = coef(model_Bigtree_ME)[1]  # Extract intercept

# Fit linear regression model for SM + ME
model_SM_ME     = lm(Supermatrix_ME ~ Product, data = df_running_time_ME)
slope_SM_ME     = coef(model_SM_ME)[2]  # Extract slope
intercept_SM_ME = coef(model_SM_ME)[1]  # Extract intercept

# Fit linear regression model for ST + ME
model_ST_ME     = lm(Supertree_ME ~ Product, data = df_running_time_ME)
slope_ST_ME     = coef(model_ST_ME)[2]  # Extract slope
intercept_ST_ME = coef(model_ST_ME)[1]  # Extract intercept

# print the slopes of this fitted lines. they have a meaning of running time growing factor.
slope_Bigtree_ME
slope_SM_ME
slope_ST_ME

# It takes a table of all families as rows (41) and it goes to a table of 246 (41*6 there are 6 methods). 
# and the column names of the methods are now a value of a new column called variable, so that is renamed to method, as well as the new column holding the running times is called value.
df_running_time_ME_metled = melt(df_running_time_ME, id = c("family_name", "Product"))
colnames(df_running_time_ME_metled) = c("name", "Product", "Method", "Time")

# Determine the x-axis range. used to acctually plot the intercept on the plot since
x_min = min(df_running_time_ML_metled$Product)
x_max = max(df_running_time_ML_metled$Product)

# plot the table 
unit_rf_5d = ggplot(data=df_running_time_ME_metled, aes(x=Product, y=Time, group=Method, color=Method)) +
    geom_point() + 
    geom_segment(x = x_min, 
                 xend = x_max, 
                 y = intercept_Bigtree_ME + slope_Bigtree_ME * x_min, 
                 yend = intercept_Bigtree_ME + slope_Bigtree_ME * x_max, 
                 color = "red", 
                 linetype = "dashed", size = 0.8) +
    geom_segment(x = x_min, 
                 xend = x_max, 
                 y = intercept_SM_ME + slope_SM_ME * x_min, 
                 yend = intercept_SM_ME + slope_SM_ME * x_max, 
                 color = "dark green", 
                 linetype = "dashed", size = 0.8) +
    geom_segment(x = x_min, 
                 xend = x_max, 
                 y = intercept_ST_ME + slope_ST_ME * x_min, 
                 yend = intercept_ST_ME + slope_ST_ME * x_max, 
                 color = "blue", 
                 linetype = "dashed", size = 0.8) +
    theme_light() + 
    ylab("CPU time (sec)") +
    xlab("num. columns * num. sequences") +
    theme(
        axis.text.x = element_text(angle = 45, hjust = 1, size=8),
        axis.text.y = element_text(size=8),
        axis.title = element_text(face = "bold"),
        legend.position="right"
        ) 
    #scale_y_continuous(limits = c(0, 0.51)) +

ggsave(filename="figures/CPU_time_per_family_empirical_set_lineplot_ME_SM_ST_Bigtree_noref_figure5d.png", plot=unit_rf_5d, dpi = "retina", width = 5, height = 5)

<br><br><br>
# USEFULL STUFF BUT NOT INCLUDED IN PAPER
<br><br><br>

In [None]:
#
# Compute RF distances between supermatrix and supertree trees and ML trees from the full alignment of all the seqs
# without Mouse MSAs 

# create the dataframe that will hold RF values for each tree comparison. 
# there are (7) reference trees for a family in this case. because there are 7 species and each one of them might yeld a diffrent paralog tree.
# for this reason the unit_sample trees are compared to all (7) reference trees.  
# all columns (except the rf_val column) will work as identifier for that comparison if put toghether.
df_all_rf = data.frame(
    method    = c(),
    family    = c(),
    species   = c(),
    unit      = c(),
    sample    = c(),
    rf_val    = c()
)

# create a dataframe with miscellaneuos informations about the gene family
df_fam_info = data.frame(
    family_name  = c(),
    num_seq      = c(),
    num_paralogs = c(),
    organism     = c()
)


# the directory prefix where all analazyd files are. this is created by main.nf pipeline in empirical data mode. 
seq_prefix = "results_without_mouse_10_samples/tcoffee/"

# iterate over all family specific output directories
for (family in pfam_dirs) {

    print(basename(family))

    # load the BIgtree MSA sequence names mapping scheme, aka tip 1 is pig_genename.
    codefile_full_aln_path = file.path(family, seq_prefix, "full_aln.code_name")
    codefile_full_aln      = read.table(codefile_full_aln_path)

    # load only the mouse MSA sequence names mapping scheme.
    codefile_path = file.path(family, seq_prefix, "file.code_name")
    codefile      = read.table(codefile_path)

    # get the codename/acronym for the organism/species in the gene family. 
    orgs_path = file.path(family, seq_prefix, paste0(basename(family), ".orthologs_org_ids_to_concatenate"))
    orgs      = read.table(orgs_path)

    # read the paralog gene names 
    paralogs_file_path = file.path(family, paste0(basename(family), ".intersecting_genes"))
    paralogs_file      = read.table(paralogs_file_path)

    # calculate numbers of: sequence in all family, number of species (7) and number of paralog per specie. 
    # All these three value are correlated: num_species * pralog_per_species = sequence_in_fam 
    num_seq_in_fam = length(codefile_full_aln[[1]])
    num_of_units   = length(orgs$V1) - 1
    num_paralogs   = length(paralogs_file$V1)

    # compile the dataframe with family name and organism present
    df_fam_info = data.frame(
        family_name  = c(df_fam_info$family_name, c(rep(basename(family), times=7))),       # repeat to match the size of organism names
        num_seq      = c(df_fam_info$num_seq, c(rep(num_seq_in_fam, times=7))),
        num_paralogs = c(df_fam_info$num_paralogs, c(rep(num_paralogs, times=7))),
        organism     = c(df_fam_info$organism, orgs[[1]])
    )

    # Get all units trees (60 = 6 organism/units (7 - maouse) * 10 samples/statistical repeats) for the combination Minimun Evolution (ME) + supermatrix method (SM).
    # make it to a binary tree using the multi2di package
    all_units_trees_seq_path = file.path(family, seq_prefix, "all_units.nwk")
    all_units_trees_seq      = read.tree(all_units_trees_seq_path, keep.multi=TRUE)
    all_units_trees_seq      = multi2di(all_units_trees_seq)

    # do the same as above for Maximum Likelihood (ML) + SM 
    all_units_raxml_trees_seq_path = file.path(family, seq_prefix, "all_units_RAxML_trees.nwk")
    all_units_raxml_trees_seq      = read.tree(all_units_raxml_trees_seq_path, keep.multi=TRUE)
    all_units_raxml_trees_seq      = multi2di(all_units_raxml_trees_seq)

    # do the same as above for ME + Super Tree (ST) method
    all_units_superfine_trees_seq_path = file.path(family, seq_prefix, "all_units_superfine_trees.nwk")
    all_units_superfine_trees_seq      = read.tree(all_units_superfine_trees_seq_path, keep.multi=TRUE)
    all_units_superfine_trees_seq      = multi2di(all_units_superfine_trees_seq)

    # do the same as above for ML + ST
    all_units_raxml_superfine_trees_seq_path = file.path(family, seq_prefix, "all_units_RAxML_superfine_trees.nwk")
    all_units_raxml_superfine_trees_seq      = read.tree(all_units_raxml_superfine_trees_seq_path, keep.multi=TRUE)
    all_units_raxml_superfine_trees_seq      = multi2di(all_units_raxml_superfine_trees_seq)

    # it will store all the species refernce paralog trees (7)
    all_species_raxml_subtrees_from_full_aln = c()
    
    # read the BigTree ML 
    species_raxml_tree_from_full_aln_path = file.path(family, seq_prefix, paste0("RAxML_bestTree.", basename(family), ".domain_sequences_prior_after_intersection_full_coded_raxml.nwk"))
    species_raxml_tree_from_full_aln = read.tree(species_raxml_tree_from_full_aln_path, keep.multi=FALSE)
    
    # for each tip in the above tree rename accordin to the mapping coding file for ML bigtree
    for (j in 1:length(species_raxml_tree_from_full_aln$tip.label)) { 
        species_raxml_tree_from_full_aln$tip.label[j] = as.character(codefile_full_aln$V1[which(codefile_full_aln$V2 == species_raxml_tree_from_full_aln[["tip.label"]][j])])
    }
    
    # do a for loop on the species (7) 
    for (species in orgs$V1) {
    
        # get the list of indices referring to the organism in question
        tips_to_keep = grep(species,species_raxml_tree_from_full_aln$tip.label)
    
        # extract the subtree (paralog tree) from the BigTree ML relating to the species in question
        test_subtree = keep.tip(species_raxml_tree_from_full_aln,tips_to_keep)
    
        # rename each tip sustituting the specific species codename with MOUSE, effectively going from ACAP2_AILME_1 to ACAP2_MOUSE_1
        # this makes all tips names from all species subtrees comparable altough they refer to a different sequence.
        # this is fine since only the correctness of the topology of the tree is tested later on. 
        test_subtree$tip.label = gsub(paste0("_", species, "_"), "_MOUSE_", test_subtree$tip.label)
    
        # the tips are further mapped to the C!, C2, C3, ecc.. nomenclature. going from  ACAP2_MOUSE_1 to C2
        for (j in 1:length(test_subtree$tip.label)) { 
            test_subtree$tip.label[j] = as.character(codefile$V2[which(codefile$V1 == test_subtree[["tip.label"]][j])])
        }
        
        # trasform the subtree into the ape multiPhylo class and added it to the list with all reference paralog trees.
        test_subtree = as.multiPhylo(test_subtree)
        all_species_raxml_subtrees_from_full_aln = c(all_species_raxml_subtrees_from_full_aln, test_subtree)
    }

    #
    # SM + ME rf computation block 
    # 
    
    # iterate over all possible species reference trees
    for (r in 1:length(all_species_raxml_subtrees_from_full_aln)) {
    
        # RF to work needs all trees it needs to compare into a single list (class) and then it will automaticly do all against all comparisons.
        # here we are comparing all unit_sample (60) family SM + ME trees with the family (7) species paralog reference trees
        trees_to_compare_ME_SM         = c(all_species_raxml_subtrees_from_full_aln[r], all_units_trees_seq)
        class(trees_to_compare_ME_SM)  = "multiPhylo"
        rf_trees_to_compare_ME_SM      = RF.dist(trees_to_compare_ME_SM, normalize=TRUE)
    
        # transform Rf.dist output format to a matrix and take the first row without the first element. 
        # Basically retrieve only the comparisons between reference tree and each unit tree
        rf_ME_SM_comparisons                = data.matrix(rf_trees_to_compare_ME_SM)[1, -1]
    
        # for loop over the created vector to correctly update the dataframe. 
        # the (60) values in the above vector  are from different units and samples in an ordered manner, the for loop is just to "count" correctly.
        # the dataframe will be updated in chuncks of 10 
        for (u in seq(10, 60, by=10)) {
        
            # update the df with all relevant info. since the rf vector is chuncked in pieces of ten all other things have to be repeated accordingly.
            df_all_rf = data.frame(
                method    = c(df_all_rf$method, rep("SM_ME", 10)),
                family    = c(df_all_rf$family, rep(basename(family), 10)),           # use family name
                species   = c(df_all_rf$reference, rep(orgs$V1[r] , 10)),             # for the reference tree get the species which it is from
                unit      = c(df_all_rf$unit, rep((u/10), 10)),
                sample    = c(df_all_rf$sample, c(1, 2, 3, 4, 5, 6 ,7 ,8, 9, 10)),
                rf_val    = c(df_all_rf$rf_val, rf_ME_SM_comparisons[(u-9):u])        # get all sample from the same unit
            )
        }
    }
    
    #
    # ST + ME rf computation block 
    # 
    
    # repeat the previous block for super tree as well
    for (r in 1:length(all_species_raxml_subtrees_from_full_aln)) {
        trees_to_compare_ME_SM         = c(all_species_raxml_subtrees_from_full_aln[r], all_units_superfine_trees_seq)
        class(trees_to_compare_ME_SM)  = "multiPhylo"
        rf_trees_to_compare_ME_SM      = RF.dist(trees_to_compare_ME_SM, normalize=TRUE)
        rf_ME_SM_comparisons                = data.matrix(rf_trees_to_compare_ME_SM)[1, -1]
        for (u in seq(10, 60, by=10)) {
            df_all_rf = data.frame(
                method    = c(df_all_rf$method, rep("ST_ME", 10)),
                family    = c(df_all_rf$family, rep(basename(family), 10)),           # use family name
                species   = c(df_all_rf$reference, rep(orgs$V1[r] , 10)),             # for the reference tree get the species which it is from
                unit      = c(df_all_rf$unit, rep((u/10), 10)),
                sample    = c(df_all_rf$sample, c(1, 2, 3, 4, 5, 6 ,7 ,8, 9, 10)),
                rf_val    = c(df_all_rf$rf_val, rf_ME_SM_comparisons[(u-9):u])        # get all sample from the same unit
            )
        }
    }
}

# save the dataframe
write.table(df_all_rf, file = "tables/RF_empirical_set_between_ref_supermatrix_and_supertree.tsv", quote = FALSE, row.names = FALSE)

In [None]:
#
# Table 2 Supermatrix-ME and Supertree-ME VS Big-Tree ML no ref.
#

# define how many units there are 
orgs_path      = file.path(family, seq_prefix, paste0(basename(family), ".orthologs_org_ids_to_concatenate"))
orgs           = read.table(orgs_path)
num_of_species = length(orgs$V1)
num_of_units   = length(orgs$V1) - 1

# read the table compiled in chapter 4. So this cell does not depend on that one to run, it just needs the table file.
df_new = read.table("tables/RF_empirical_set_between_Bigtree_ML_no_ref_supermatrix_and_supertree.tsv", header=T)

# Group data by method and unit, then calculate mean and standard deviation for all families 
summary_table = df_new %>%
  group_by(method, unit) %>%
  summarise(
    mean_rf_val = mean(rf_val, na.rm = TRUE),
    sd_rf_val = sd(rf_val, na.rm = TRUE)
  ) %>%
  ungroup()

# print the unit rf table and save it
summary_table
write.table(summary_table, file = "tables/RF_empirical_set_per_unit_SM_ST_ME.tsv", quote = FALSE, row.names = FALSE)

In [None]:
#
# Fig 4 Supermatrix-ME and Supertree-ME VS Big-Tree ML
#

# read the table compiled in chapter 4 cell above. So this cell does not depend on that one to run, it just needs the table file.
df_figure4 = read.table("tables/RF_empirical_set_per_unit_SM_ST_ME.tsv", header=T)

# Add prefix "unit_" to the "unit" column
df_figure4$unit = paste0("unit_", df_figure4$unit)

# set limits for plot
options(repr.plot.width = 5, repr.plot.height = 5, repr.plot.res = 300)

# plot the table 
unit_rf = ggplot(data=df_figure4, aes(x=unit, y=mean_rf_val, group=method, color=method)) +
    geom_line() + 
    geom_point() + 
    theme_light() + 
    ylab("Normalized RF") + xlab("Topology") +
    theme(
        axis.text.x = element_text(angle = 45, hjust = 1, size=8),
        axis.text.y = element_text(size=8),
        axis.title = element_text(face = "bold"),
        legend.position="right"
        ) + 
    scale_y_continuous(limits = c(0, 0.51)) +
    # Custom legend labels while keeping default colors
    scale_color_discrete(
        name = "Method", 
        labels = c("ST_ME" = "Supertree-ME", "SM_ME" = "Supermatrix-ME") # Custom legend labels
    ) 


ggsave(filename="figures/RF_lineplot_ME_seq_vs_ML_Big_tree.png", plot=unit_rf, dpi = "retina", width = 5, height = 5)