# Analysis of simulated data


This is a brief description of your project or analysis. Explain the purpose, goals, or context of the notebook here.

## Table of Contents
- [1. Chapter 1 - Introduction](#chapter-1---introduction)
- [2. Chapter 2 - R packages](#chapter-2---R-packages)
- [3. Chapter 3 - Input specification](#chapter-3---Input-specification)
- [4. Chapter 4 - Supermatrix and Supertree trees VS true topologies](#chapter-4---Supermatrix-and-Supertree-trees-VS-true-topologies)
- [5. Chapter 5 - Model Building](#chapter-5---model-building)
- [6. Chapter 6 - Conclusion](#chapter-6---conclusion)


## Chapter 1 - Introduction
<a id="chapter-1---introduction"></a>

Here, you provide an introduction to the chapter. Explain key concepts, background, or objectives. Things to know what usefull to know about this script.


## Chapter 2 - R packages
<a id="chapter-3---R-packages"></a>

Here is the list of packages needed for this notebook.

In [1]:
library("phangorn")
library("ggplot2")
library("reshape2")
library("geiger")
library("adephylo")
library("phytools")
library("rlist")
library("plotrix")
library("plyr")
library("dplyr")
library("hrbrthemes")
library("rstatix")
library("gplots")
library("viridis")
library('gtools')
library(gridExtra)
library(grid)
library("ggforce")
library("scales")

Loading required package: ape

Loading required package: phytools

Loading required package: maps

Loading required package: ade4


Attaching package: ‘plotrix’


The following object is masked from ‘package:phytools’:

    rescale



Attaching package: ‘plyr’


The following object is masked from ‘package:maps’:

    ozone



Attaching package: ‘dplyr’


The following objects are masked from ‘package:plyr’:

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize


The following object is masked from ‘package:ape’:

    where


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



Attaching package: ‘rstatix’


The following objects are masked from ‘package:plyr’:

    desc, mutate


The following object is masked from ‘package:stats’:

    filter



Attaching package: ‘gplots’


The following object is masked from ‘package:plotrix’:

    plotC

## Chapter 3 - Input specification
<a id="chapter-3---Input-specification"></a>

Change the parent_of_all_dir variable to give the location of the input. All input files are expected to be under it. Otherwise change the single input location to you liking.<br>
All input files or directory paths are listed here.

In [2]:
# the location of the directory that contains all inputs for analyses 
parent_of_all_dir = "/users/cn/abaltzis/projects/1_paralogs/simulated_datasets/simulated_sequences/"

# The directory (one or more) that contain all the results from the main.nf pipeline run. It can be a glob path.
dir_containing_all_pipe_results = "../results"
#dir_containing_all_pipe_results = paste0(parent_of_all_dir + "rerun_25_species_corrected_branches/rerun_sim_b_factor*")

#  
reference_tree_codefile = file.path(parent_of_all_dir, "ref_tree.code")

# get the species/organism list of names used by the main.nf pipeline
species_name_file = file.path(parent_of_all_dir, "orthologs_org_ids_to_concatenate")

## Chapter 4 - Supermatrix and Supertree trees VS true topologies
<a id="chapter-4---Supermatrix-and-Supertree-trees-VS-true-topologies"></a>

description 

In [9]:
RF_avg_list = list()
RF_se_list = list()
RF_avg_norm_list = list()
collect_branch_len_supermatrix_ml = c()
collect_diameter_supermatrix_ml = c()

# each run of main.nf has one output directory. here we get that or multiple runs outputs directories.
run_dirs = Sys.glob(dir_containing_all_pipe_results)
run_dirs = run_dirs[file.info(run_dirs)$isdir]

for (run in run_dirs) {
    print(basename(run))

    ref_codefile = read.table(file = reference_tree_codefile )
    
    # get al the family subdirectories in each maqin.nf output run dir
    fam_dirs = grep("avg_branchlen*", list.dirs(run, recursive = FALSE), value = TRUE)
    count_fam = 0

    # this will contain the mean branch length for all true/reference paralog trees for each gene family
    avg_branch_length = c()

    # read the names of the ortholog species.
    orgs = read.table(file = species_name_file)

    # get all units trees for all combination of methods 
    for (m in fam_dirs) {
        print(basename(m))
        count_fam = count_fam + 1

        # Get all units trees (250 = 25 organism * 10 samples/statistical repeats) for the combination Minimun Evolution (ME) + supermatrix method (SM). 
        # We unroot every tree and fully resolve it to use the RF.dist package and others. 
        all_units_file  = file.path(m, "results/rerun", "all_units.nwk")
        all_units_trees = read.tree(all_units_file, keep.multi=TRUE)
        all_units_trees = unroot(all_units_trees)
        
        # check if there any non-bynary (consensus trees) among the 250 trees present in the file above
        if (length(which(is.binary(all_units_trees) == FALSE)) != 0) {
            
            # if there are non-binary trees, iterate through them and transformed into a binary one using multi2di package
            for (tree in which(is.binary(all_units_trees) == FALSE)) {
                all_units_trees[[tree]] = multi2di(all_units_trees[[tree]])
            }
        }

        # Get all units trees (250) ME + supertree (ST) method and handle non-binary trees as above
        all_superfine_file  = file.path(m, "results/rerun", "all_units_superfine_trees.nwk")
        all_superfine_trees = read.tree(all_superfine_file, keep.multi=TRUE)
        all_superfine_trees = unroot(all_superfine_trees)
        if (length(which(is.binary(all_superfine_trees) == FALSE)) != 0) {
            for (tree in which(is.binary(all_superfine_trees) == FALSE)) {
                all_superfine_trees[[tree]] = multi2di(all_superfine_trees[[tree]])
            }
        }

        # Get all units trees (250) Maximum Likelihood (ML) + SM 
        all_units_raxml_file  = file.path(m, "results/rerun", "all_units_RAxML_trees.nwk")
        all_units_raxml_trees = read.tree(all_units_raxml_file, keep.multi=TRUE)
        all_units_raxml_trees = unroot(all_units_raxml_trees)
        if (length(which(is.binary(all_units_raxml_trees) == FALSE)) != 0) {
            for (tree in which(is.binary(all_units_raxml_trees) == FALSE)) {
                all_units_raxml_trees[[tree]] = multi2di(all_units_raxml_trees[[tree]])
            }
        }

        # take only the last 10 of the 250 above trees (unit25's 10 samples/replicates)(unit25 = all 25 single units concatenated) and collect branch lengths (each branch in tree, 27 in total 2*15-3) and diameter (using patristic distance)
        for(mytree in 241:length(all_units_raxml_trees)) {
            collect_branch_len_supermatrix_ml = c(collect_branch_len_supermatrix_ml, all_units_raxml_trees[[mytree]]$edge.length)
            collect_diameter_supermatrix_ml = c(collect_diameter_supermatrix_ml, max(distTips(all_units_raxml_trees[[mytree]], method = 'patristic')))
        }

        # Get all units trees (250) ML + ST
        all_raxml_superfine_file  = file.path(m, "results/rerun", "all_units_RAxML_superfine_trees.nwk")
        all_raxml_superfine_trees = read.tree(all_raxml_superfine_file, keep.multi=TRUE)
        all_raxml_superfine_trees = unroot(all_raxml_superfine_trees)
        if (length(which(is.binary(all_raxml_superfine_trees) == FALSE)) != 0) {
            for (tree in which(is.binary(all_raxml_superfine_trees) == FALSE)) {
                all_raxml_superfine_trees[[tree]] = multi2di(all_raxml_superfine_trees[[tree]])
            }
        }

        # Reference tree block, each gene family has it's own reference tree that is the one generated by the simulation step. Along this tree all sequences are generated and then alligned and passed as input for main.nf pipeline.
        ref_tree_file = file.path(run, paste0(basename(m), ".nwk"))
        ref_tree      = read.tree(ref_tree_file, keep.multi=FALSE)

        # Since each gene family has a predifined gene tree (ref_tree) composed of a fixed species tree pasted as leaf of the paralog tree. The tree is designed like this by choic4. Each gene families varies only in the paralog tree that it has that is unique (15 tips).
        # Here we extract the paralog tree and to do so we just need to extract the subtree with all the sequences of omne species. Since each species have the same paralog tree. So we can extract the subtree of whichever species it does not matter.
        ref_subtree   = keep.tip(ref_tree,as.character(ref_codefile$V1))

        # we then rename the tips of the paralog tree (ref_subtree) to a more generic keyname. Effectively going from rat1 to seq1. 
        for (j in 1:length(ref_subtree$tip.label)) { 
            ref_subtree$tip.label[j] = as.character(ref_codefile$V2[which(ref_codefile$V1 == ref_subtree[["tip.label"]][j])])
        }

        # Unroot the tree and perform aother operation and then save the paralog tree to the specifc gene faimily directory
        ref_subtree        = unroot(ref_subtree)
        ref_subtree        = list(ref_subtree)
        class(ref_subtree) = "multiPhylo"
        subtree_out_file   = file.path(m, paste0("ref_subtree_", basename(m), ".nwk"))
        write.tree(ref_subtree, file=subtree_out_file)

        # update on the average branch length with the mean of the true/refernce paralog tree of the given gene family
        avg_branch_length = c(avg_branch_length, mean(ref_subtree[[1]]$edge.length))

        # Make a list containing all trees gathered so far and compute all possible pair comparisons among trees. There are 1001*1001-1001 unique pairewise comparisons ( a square matrix of size 1001 with 0 in diagonal ).
        # On this matrix (plotdata) all other measuramnts are done. This is a symmetric matrix (distance matrix)
        #all_trees        = c(ref_subtree, all_units_trees, all_superfine_trees, all_units_raxml_trees, all_raxml_superfine_trees)
        all_trees        = c(ref_subtree, all_units_trees[seq(1, 250, 10)], all_superfine_trees[seq(1, 250, 10)], all_units_raxml_trees[seq(1, 250, 10)], all_raxml_superfine_trees[seq(1, 250, 10)])
        class(all_trees) = "multiPhylo"
        all_vs_all_rf    = RF.dist(all_trees,normalize=TRUE)
        plotdata         = data.matrix(all_vs_all_rf)

        # all ME metrics to be computed
        mean_rf           = c()
        se_rf             = c()
        mean_rf_norm      = c()
        mean_superfine_rf = c()
        se_superfine_rf   = c()

        # all ML metrics to be computed
        mean_raxml_rf           = c()
        se_raxml_rf             = c()
        mean_raxml_rf_norm      = c()
        mean_raxml_superfine_rf = c()
        se_raxml_superfine_rf   = c()

        # proceed throught the plotdata matrix taking from the second element one every ten up to 242. Basically proceed at chunkcs of 10 (samples). On those chuncks compute the mean and the Standard error. 
        # average all sample "values" whithin a unit. going from 250 value to 25 averages
        # This iterate only through the values of the comparison between the reference paralofg tree (ref_subtree)(first column/row of matrix) and the 250 paraogs trees of ME + SM (all_units_trees)
        for (i in seq(2, (length(orgs$V1)*10)+1, by=10)) {
            mean_rf = c(mean_rf, mean(plotdata[1, i:(i+9)]))
            se_rf   = c(se_rf, (sd(plotdata[1, i:(i+9)])/sqrt(10)))
        }

        # This instead goes from row 252 to 492 of the plotdata matrix. It still goes in chuncks of 10 (samples). 
        # So this effectively does what is done above but between the reference paralofg tree and the 250 paraogs trees of ME + ST (all_superfine_trees)
        for (i in seq( (length(orgs$V1)*10)+2, (2*(length(orgs$V1)*10)+1), by=10 )) {
            mean_superfine_rf = c(mean_superfine_rf, mean(plotdata[1,i:(i+9)]))
            se_superfine_rf   = c(se_superfine_rf, (sd(plotdata[1,i:(i+9)])/sqrt(10)))
        }

        # This for loop goes from 502 to 742. It still goes in chuncks of 10 (samples).
        # So this effectively does what is done above but between the reference paralofg tree and the 250 paraogs trees of ML + SM (all_units_raxml_trees)
        for (i in seq( (2*(length(orgs$V1)*10))+2, (3*(length(orgs$V1)*10)+1) ,by=10 )) {
            mean_raxml_rf = c(mean_raxml_rf, mean(plotdata[1,i:(i+9)]))
            se_raxml_rf   = c(se_raxml_rf, (sd(plotdata[1,i:(i+9)])/sqrt(10)))
        }

        # This for loop goes from 752 to 992. It still goes in chuncks of 10 (samples).
        # So this effectively does what is done above but between the reference paralofg tree and the 250 paraogs trees of ML + ST (all_raxml_superfine_trees)
        for (i in seq( (3*(length(orgs$V1)*10))+2, (4*(length(orgs$V1)*10)+1), by=10)) {
            mean_raxml_superfine_rf = c(mean_raxml_superfine_rf, mean(plotdata[1,i:(i+9)]))
            se_raxml_superfine_rf   = c(se_raxml_superfine_rf, (sd(plotdata[1,i:(i+9)])/sqrt(10)))
        }

        
    }
}

[1] "results"
[1] "avg_branchlen_0.7_protein_family_001.sub"


ERROR: Error in plotdata[1, i:(i + 9)]: subscript out of bounds


In [72]:
plotdata

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,⋯,32,33,34,35,36,37,38,39,40,41
1,0.0,0.3333333,0.41666667,0.33333333,0.25,0.25,0.33333333,0.33333333,0.25,0.33333333,⋯,0.5,0.25,0.25,0.25,0.33333333,0.25,0.33333333,0.16666667,0.33333333,0.16666667
2,0.3333333,0.0,0.41666667,0.33333333,0.25,0.25,0.33333333,0.33333333,0.25,0.33333333,⋯,0.3333333,0.25,0.33333333,0.33333333,0.41666667,0.25,0.41666667,0.16666667,0.41666667,0.16666667
3,0.4166667,0.4166667,0.0,0.08333333,0.16666667,0.16666667,0.08333333,0.08333333,0.16666667,0.08333333,⋯,0.4166667,0.16666667,0.33333333,0.25,0.33333333,0.16666667,0.41666667,0.25,0.41666667,0.25
4,0.3333333,0.3333333,0.08333333,0.0,0.08333333,0.08333333,0.0,0.0,0.08333333,0.0,⋯,0.4166667,0.08333333,0.25,0.16666667,0.25,0.08333333,0.33333333,0.16666667,0.33333333,0.16666667
5,0.25,0.25,0.16666667,0.08333333,0.0,0.0,0.08333333,0.08333333,0.0,0.08333333,⋯,0.4166667,0.0,0.16666667,0.08333333,0.16666667,0.0,0.25,0.08333333,0.25,0.08333333
6,0.25,0.25,0.16666667,0.08333333,0.0,0.0,0.08333333,0.08333333,0.0,0.08333333,⋯,0.4166667,0.0,0.16666667,0.08333333,0.16666667,0.0,0.25,0.08333333,0.25,0.08333333
7,0.3333333,0.3333333,0.08333333,0.0,0.08333333,0.08333333,0.0,0.0,0.08333333,0.0,⋯,0.4166667,0.08333333,0.25,0.16666667,0.25,0.08333333,0.33333333,0.16666667,0.33333333,0.16666667
8,0.3333333,0.3333333,0.08333333,0.0,0.08333333,0.08333333,0.0,0.0,0.08333333,0.0,⋯,0.4166667,0.08333333,0.25,0.16666667,0.25,0.08333333,0.33333333,0.16666667,0.33333333,0.16666667
9,0.25,0.25,0.16666667,0.08333333,0.0,0.0,0.08333333,0.08333333,0.0,0.08333333,⋯,0.4166667,0.0,0.16666667,0.08333333,0.16666667,0.0,0.25,0.08333333,0.25,0.08333333
10,0.3333333,0.3333333,0.08333333,0.0,0.08333333,0.08333333,0.0,0.0,0.08333333,0.0,⋯,0.4166667,0.08333333,0.25,0.16666667,0.25,0.08333333,0.33333333,0.16666667,0.33333333,0.16666667


In [76]:
seq(2, (length(orgs$V1)*10)+1, by=10)

In [77]:
(length(orgs$V1)*10)+1

In [78]:
orgs$V1

In [80]:
plotdata[1, 2:(2+9)]

In [6]:
seq( (length(orgs$V1)*10)+2, (2*(length(orgs$V1)*10)+1), by=10 )

In [7]:
seq( (2*(length(orgs$V1)*10))+2, (3*(length(orgs$V1)*10)+1) ,by=10 )

In [8]:
seq( (3*(length(orgs$V1)*10))+2, (4*(length(orgs$V1)*10)+1), by=10)