# Analysis of mouse empirical dataset

TODO add here a brief description of your project or analysis. Explain the purpose, goals, or context of the notebook here.

## Table of Contents
- [1. Chapter 1 - Introduction](#chapter-1-introduction)
- [2. Chapter 2 - R packages](#chapter-2-R-packages)
- [3. Chapter 3 - Input specification](#chapter-3-Input-specification)
- [4. Chapter 4 - RF distance summaries](#chapter-4-RF-distance-summaries)

<a id="chapter-1-introduction"></a>

## Chapter 1 - Introduction

TODO add description of notebook and how to use it.


<a id="chapter-2-R-packages"></a>

## Chapter 2 - R packages

Here is the list of packages needed for this notebook.

In [27]:
library("phangorn")
library("rlist")
library("reshape2")
library("ggplot2")
library("geiger")
library("adephylo")
library("phytools")
library("hrbrthemes")
#library("ggtree")
#library("pheatmap")
library("gplots")
library("ape")
library("plotrix")
library("plyr")
library("dplyr")
library("rstatix")
library("grid")
library("gridExtra")
library("ggforce")
library("viridis")
library("tidyr")


Attaching package: ‘tidyr’


The following object is masked from ‘package:reshape2’:

    smiths




<a id="chapter-3-Input-specification"></a>

## Chapter 3 - Input specification

Change the parent_of_all_dir variable to give the location of the input. All input files are expected to be under it. Otherwise change the single input location to you liking.<br>
All input files or directory paths are listed here.

In [2]:
# the location of the directory that contains all inputs for analyses 
parent_of_all_dir = "/users/cn/avignoli/1_paralogs/MOUSE_above_40_entries_clean_run/rerun_with_struct_concat/"

# The paths to all PFAM families directories
pfam_dirs_paths = paste0(parent_of_all_dir, "PF*")

# here we get all PFAM directories paths into an iterable used later on in the analysis
pfam_dirs = Sys.glob(pfam_dirs_paths)
pfam_dirs = pfam_dirs[file.info(pfam_dirs)$isdir]

# Create the 'figures' directory if it doesn't exist. Created where this notebook is executed.
if (!dir.exists("figures")) {
    dir.create("figures")
}

# Create the 'tables' directory if it doesn't exist. Created where this notebook is executed.
if (!dir.exists("tables")) {
    dir.create("tables")
}

<a id="chapter-4-RF-distance-summaries"></a>

## Chapter 4 - RF distance summaries

TODO add description 

In [24]:
#
# Compute RF distances between supermatrix and supertree trees and ML trees from the full alignment of all the seqs
# without Mouse MSAs and structure-based MSAs

# create the dataframe that will hold RF values for each tree comparison. 
# there are (7) reference trees for a family in this case. because there are 7 species and each one of them might yeld a diffrent paralog tree.
# for this reason the unist_sample trees are compared to all (7) reference trees.  
# all columns (except the rf_val column) will work as identifier for that comparison if put toghether.
df_all_rf = data.frame(
    method    = c(),
    family    = c(),
    species   = c(),
    unit      = c(),
    sample    = c(),
    rf_val    = c()
)

# create a dataframe with miscellaneuos informations about the gene family
df_fam_info = data.frame(
    family_name  = c(),
    num_seq      = c(),
    num_paralogs = c(),
    organism     = c()
)


# the directory prefix where all analazyd files are. this is created by main.nf pipeline in empirical data mode. 
seq_prefix = "results_without_mouse_10_samples/tcoffee/"

# iterate over all family specific output directories
for (family in pfam_dirs) {

    print(basename(family))

    # load the BIgtree MSA sequence names mapping scheme, aka tip 1 is pig_genename.
    codefile_full_aln_path = file.path(family, seq_prefix, "full_aln.code_name")
    codefile_full_aln      = read.table(codefile_full_aln_path)

    # load only the mouse MSA sequence names mapping scheme.
    codefile_path = file.path(family, seq_prefix, "file.code_name")
    codefile      = read.table(codefile_path)

    # get the codename/acronym for the organism/species in the gene family. 
    orgs_path = file.path(family, seq_prefix, paste0(basename(family), ".orthologs_org_ids_to_concatenate"))
    orgs      = read.table(orgs_path)

    # read the paralog gene names 
    paralogs_file_path = file.path(family, paste0(basename(family), ".intersecting_genes"))
    paralogs_file      = read.table(paralogs_file_path)

    # calculate numbers of: sequence in all family, number of species (7) and number of paralog per specie. 
    # All these three value are correlated: num_species * pralog_per_species = sequence_in_fam 
    num_seq_in_fam = length(codefile_full_aln[[1]])
    num_of_units   = length(orgs$V1) - 1
    num_paralogs   = length(paralogs_file$V1)

    # compile the dataframe with family name and organism present
    df_fam_info = data.frame(
        family_name  = c(df_fam_info$family_name, c(rep(basename(family), times=7))),       # repeat to match the size of organism names
        num_seq      = c(df_fam_info$num_seq, c(rep(num_seq_in_fam, times=7))),
        num_paralogs = c(df_fam_info$num_paralogs, c(rep(num_paralogs, times=7))),
        organism     = c(df_fam_info$organism, orgs[[1]])
    )

    # Get all units trees (60 = 6 organism/units (7 - maouse) * 10 samples/statistical repeats) for the combination Minimun Evolution (ME) + supermatrix method (SM).
    # make it to a binary tree using the multi2di package
    all_units_trees_seq_path = file.path(family, seq_prefix, "all_units.nwk")
    all_units_trees_seq      = read.tree(all_units_trees_seq_path, keep.multi=TRUE)
    all_units_trees_seq      = multi2di(all_units_trees_seq)

    # do the same as above for Maximum Likelihood (ML) + SM 
    all_units_raxml_trees_seq_path = file.path(family, seq_prefix, "all_units_RAxML_trees.nwk")
    all_units_raxml_trees_seq      = read.tree(all_units_raxml_trees_seq_path, keep.multi=TRUE)
    all_units_raxml_trees_seq      = multi2di(all_units_raxml_trees_seq)

    # do the same as above for ME + Super Tree (ST) method
    all_units_superfine_trees_seq_path = file.path(family, seq_prefix, "all_units_superfine_trees.nwk")
    all_units_superfine_trees_seq      = read.tree(all_units_superfine_trees_seq_path, keep.multi=TRUE)
    all_units_superfine_trees_seq      = multi2di(all_units_superfine_trees_seq)

    # do the same as above for ML + ST
    all_units_raxml_superfine_trees_seq_path = file.path(family, seq_prefix, "all_units_RAxML_superfine_trees.nwk")
    all_units_raxml_superfine_trees_seq      = read.tree(all_units_raxml_superfine_trees_seq_path, keep.multi=TRUE)
    all_units_raxml_superfine_trees_seq      = multi2di(all_units_raxml_superfine_trees_seq)

    # it will store all the species refernce paralog trees (7)
    all_species_raxml_subtrees_from_full_aln = c()
    
    # read the BigTree ML 
    species_raxml_tree_from_full_aln_path = file.path(family, seq_prefix, paste0("RAxML_bestTree.", basename(family), ".domain_sequences_prior_after_intersection_full_coded_raxml.nwk"))
    species_raxml_tree_from_full_aln = read.tree(species_raxml_tree_from_full_aln_path, keep.multi=FALSE)
    
    # for each tip in the above tree rename accordin to the mapping coding file for ML bigtree
    for (j in 1:length(species_raxml_tree_from_full_aln$tip.label)) { 
        species_raxml_tree_from_full_aln$tip.label[j] = as.character(codefile_full_aln$V1[which(codefile_full_aln$V2 == species_raxml_tree_from_full_aln[["tip.label"]][j])])
    }
    
    # do a for loop on the species (7) 
    for (species in orgs$V1) {
    
        # get the list of indices referring to the organism in question
        tips_to_keep = grep(species,species_raxml_tree_from_full_aln$tip.label)
    
        # extract the subtree (paralog tree) from the BigTree ML relating to the species in question
        test_subtree = keep.tip(species_raxml_tree_from_full_aln,tips_to_keep)
    
        # rename each tip sustituting the specific species codename with MOUSE, effectively going from ACAP2_AILME_1 to ACAP2_MOUSE_1
        # this makes all tips names from all species subtrees comparable altough they refer to a different sequence.
        # this is fine since only the correctness of the topology of the tree is tested later on. 
        test_subtree$tip.label = gsub(paste0("_", species, "_"), "_MOUSE_", test_subtree$tip.label)
    
        # the tips are further mapped to the C!, C2, C3, ecc.. nomenclature. going from  ACAP2_MOUSE_1 to C2
        for (j in 1:length(test_subtree$tip.label)) { 
            test_subtree$tip.label[j] = as.character(codefile$V2[which(codefile$V1 == test_subtree[["tip.label"]][j])])
        }
        
        # trasform the subtree into the ape multiPhylo class and added it to the list with all reference paralog trees.
        test_subtree = as.multiPhylo(test_subtree)
        all_species_raxml_subtrees_from_full_aln = c(all_species_raxml_subtrees_from_full_aln, test_subtree)
    }

    #
    # SM + ME rf computation block 
    # 
    
    # iterate over all possible species reference trees
    for (r in 1:length(all_species_raxml_subtrees_from_full_aln)) {
    
        # RF to work needs all trees it needs to compare into a single list (class) and then it will automaticly do all against all comparisons.
        # here we are comparing all unit_sample (60) family SM + ME trees with the family (7) species paralog reference trees
        trees_to_compare_ME_SM         = c(all_species_raxml_subtrees_from_full_aln[r], all_units_trees_seq)
        class(trees_to_compare_ME_SM)  = "multiPhylo"
        rf_trees_to_compare_ME_SM      = RF.dist(trees_to_compare_ME_SM, normalize=TRUE)
    
        # transform Rf.dist output format to a matrix and take the first row without the first element. 
        # Basically retrieve only the comparisons between reference tree and each unit tree
        rf_ME_SM_comparisons                = data.matrix(rf_trees_to_compare_ME_SM)[1, -1]
    
        # for loop over the created vector to correctly update the dataframe. 
        # the (60) values in the above vector  are from different units and samples in an ordered manner, the for loop is just to "count" correctly.
        # the dataframe will be updated in chuncks of 10 
        for (u in seq(10, 60, by=10)) {
        
            # update the df with all relevant info. since the rf vector is chuncked in pieces of ten all other things have to be repeated accordingly.
            df_all_rf = data.frame(
                method    = c(df_all_rf$method, rep("SM_ME", 10)),
                family    = c(df_all_rf$family, rep(basename(family), 10)),           # use family name
                species   = c(df_all_rf$reference, rep(orgs$V1[r] , 10)),             # for the reference tree get the species which it is from
                unit      = c(df_all_rf$unit, rep((u/10), 10)),
                sample    = c(df_all_rf$sample, c(1, 2, 3, 4, 5, 6 ,7 ,8, 9, 10)),
                rf_val    = c(df_all_rf$rf_val, rf_ME_SM_comparisons[(u-9):u])        # get all sample from the same unit
            )
        }
    }
    
    #
    # ST + ME rf computation block 
    # 
    
    # repeat the previous block for super tree as well
    for (r in 1:length(all_species_raxml_subtrees_from_full_aln)) {
        trees_to_compare_ME_SM         = c(all_species_raxml_subtrees_from_full_aln[r], all_units_superfine_trees_seq)
        class(trees_to_compare_ME_SM)  = "multiPhylo"
        rf_trees_to_compare_ME_SM      = RF.dist(trees_to_compare_ME_SM, normalize=TRUE)
        rf_ME_SM_comparisons                = data.matrix(rf_trees_to_compare_ME_SM)[1, -1]
        for (u in seq(10, 60, by=10)) {
            df_all_rf = data.frame(
                method    = c(df_all_rf$method, rep("ST_ME", 10)),
                family    = c(df_all_rf$family, rep(basename(family), 10)),           # use family name
                species   = c(df_all_rf$reference, rep(orgs$V1[r] , 10)),             # for the reference tree get the species which it is from
                unit      = c(df_all_rf$unit, rep((u/10), 10)),
                sample    = c(df_all_rf$sample, c(1, 2, 3, 4, 5, 6 ,7 ,8, 9, 10)),
                rf_val    = c(df_all_rf$rf_val, rf_ME_SM_comparisons[(u-9):u])        # get all sample from the same unit
            )
        }
    }
}

# save the dataframe
write.table(df_all_rf, file = "tables/RF_empirical_set_between_ref_supermatrix_and_supertree.tsv", quote = FALSE, row.names = FALSE)

[1] "PF00004"
[1] "PF00019"
[1] "PF00022"
[1] "PF00025"
[1] "PF00041"
[1] "PF00059"
[1] "PF00061"
[1] "PF00092"
[1] "PF00102"
[1] "PF00104"
[1] "PF00106"
[1] "PF00122"
[1] "PF00134"
[1] "PF00167"
[1] "PF00168"
[1] "PF00178"
[1] "PF00179"
[1] "PF00250"
[1] "PF00307"
[1] "PF00373"
[1] "PF00531"
[1] "PF00566"
[1] "PF00595"
[1] "PF00615"
[1] "PF00620"
[1] "PF00621"
[1] "PF00622"
[1] "PF00640"
[1] "PF00777"
[1] "PF00782"
[1] "PF00787"
[1] "PF00788"
[1] "PF00822"
[1] "PF01392"
[1] "PF01412"
[1] "PF01421"
[1] "PF02214"
[1] "PF02931"
[1] "PF07679"
[1] "PF08205"
[1] "PF12796"


In [35]:
#
# Table 2 Supermatrix-ME and Supertree-ME VS Big-Tree ML
#

# define how many units there are 
orgs_path      = file.path(family, seq_prefix, paste0(basename(family), ".orthologs_org_ids_to_concatenate"))
orgs           = read.table(orgs_path)
num_of_species = length(orgs$V1)
num_of_units   = length(orgs$V1) - 1

# read the table compiled in chapter 4. So this cell does not depend on that one to run, it just needs the table file.
df_new = read.table("tables/RF_empirical_set_between_ref_supermatrix_and_supertree.tsv", header=T)

# Group data by method and unit, then calculate mean and standard deviation for all families 
summary_table = df_new %>%
  group_by(method, unit) %>%
  summarise(
    mean_rf_val = mean(rf_val, na.rm = TRUE),
    sd_rf_val = sd(rf_val, na.rm = TRUE)
  ) %>%
  ungroup()

# print the unit rf table and save it
summary_table
write.table(summary_table, file = "tables/RF_empirical_set_per_unit_SM_ST_ME.tsv", quote = FALSE, row.names = FALSE)

[1m[22m`summarise()` has grouped output by 'method'. You can override using the
`.groups` argument.


method,unit,mean_rf_val,sd_rf_val
<chr>,<int>,<dbl>,<dbl>
SM_ME,1,0.4813773,0.2207796
SM_ME,2,0.4252749,0.2183687
SM_ME,3,0.3943477,0.2192552
SM_ME,4,0.3688909,0.2233997
SM_ME,5,0.3500012,0.2273973
SM_ME,6,0.3413522,0.2263561
ST_ME,1,0.4813773,0.2207796
ST_ME,2,0.4558193,0.2256991
ST_ME,3,0.4220801,0.2225704
ST_ME,4,0.4013759,0.2250145


In [56]:
#
# Fig 4 Supermatrix-ME and Supertree-ME VS Big-Tree ML
#

# read the table compiled in chapter 4 cell above. So this cell does not depend on that one to run, it just needs the table file.
df_figure4 = read.table("tables/RF_empirical_set_per_unit_SM_ST_ME.tsv", header=T)

# Add prefix "unit_" to the "unit" column
df_figure4$unit = paste0("unit_", df_figure4$unit)

# set limits for plot
options(repr.plot.width = 5, repr.plot.height = 5, repr.plot.res = 300)

# plot the table 
unit_rf = ggplot(data=df_figure4, aes(x=unit, y=mean_rf_val, group=method, color=method)) +
    geom_line() + 
    geom_point() + 
    theme_light() + 
    ylab("Normalized RF") + xlab("Topology") +
    theme(
        axis.text.x = element_text(angle = 45, hjust = 1, size=8),
        axis.text.y = element_text(size=8),
        axis.title = element_text(face = "bold"),
        legend.position="right"
        ) + 
    scale_y_continuous(limits = c(0, 0.51)) +
    # Custom legend labels while keeping default colors
    scale_color_discrete(
        name = "Method", 
        labels = c("ST_ME" = "Supertree-ME", "SM_ME" = "Supermatrix-ME") # Custom legend labels
    ) 


ggsave(filename="figures/RF_lineplot_ME_seq_vs_ML_Big_tree.png", plot=unit_rf, dpi = "retina", width = 5, height = 5)