# Analysis of mouse empirical dataset

TODO add here a brief description of your project or analysis. Explain the purpose, goals, or context of the notebook here.

## Table of Contents
- [1. Chapter 1 - Introduction](#chapter-1-introduction)
- [2. Chapter 2 - R packages](#chapter-2-R-packages)
- [3. Chapter 3 - Input specification](#chapter-3-Input-specification)
- [4. Chapter 4 - RF distance summaries](#chapter-4-RF-distance-summaries)

<a id="chapter-1-introduction"></a>

## Chapter 1 - Introduction

TODO add description of notebook and how to use it.


<a id="chapter-2-R-packages"></a>

## Chapter 2 - R packages

Here is the list of packages needed for this notebook.

In [1]:
library("phangorn")
library("rlist")
library("reshape2")
library("ggplot2")
library("geiger")
library("adephylo")
library("phytools")
library("hrbrthemes")
#library("ggtree")
#library("pheatmap")
library("gplots")
library("ape")
library("plotrix")
library("plyr")
library("dplyr")
library("rstatix")
library("grid")
library("gridExtra")
library("ggforce")
library("viridis")

Loading required package: ape

Loading required package: phytools

Loading required package: maps

Loading required package: ade4


Attaching package: ‘gplots’


The following object is masked from ‘package:stats’:

    lowess



Attaching package: ‘plotrix’


The following object is masked from ‘package:gplots’:

    plotCI


The following object is masked from ‘package:phytools’:

    rescale



Attaching package: ‘plyr’


The following object is masked from ‘package:maps’:

    ozone



Attaching package: ‘dplyr’


The following objects are masked from ‘package:plyr’:

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize


The following object is masked from ‘package:ape’:

    where


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



Attaching package: ‘rstatix’


The following objects are masked from ‘package:plyr’:

    desc, mutate

<a id="chapter-3-Input-specification"></a>

## Chapter 3 - Input specification

Change the parent_of_all_dir variable to give the location of the input. All input files are expected to be under it. Otherwise change the single input location to you liking.<br>
All input files or directory paths are listed here.

In [2]:
# the location of the directory that contains all inputs for analyses 
parent_of_all_dir = "/users/cn/avignoli/1_paralogs/MOUSE_above_40_entries_clean_run/rerun_with_struct_concat/"

# The paths to all PFAM families directories
pfam_dirs_paths = paste0(parent_of_all_dir, "PF*")

# here we get all PFAM directories paths into an iterable used later on in the analysis
pfam_dirs = Sys.glob(pfam_dirs_paths)
pfam_dirs = pfam_dirs[file.info(pfam_dirs)$isdir]

# Create the 'figures' directory if it doesn't exist. Created where this notebook is executed.
if (!dir.exists("figures")) {
    dir.create("figures")
}

# Create the 'tables' directory if it doesn't exist. Created where this notebook is executed.
if (!dir.exists("tables")) {
    dir.create("tables")
}

<a id="chapter-4-RF-distance-summaries"></a>

## Chapter 4 - RF distance summaries

TODO add description 

In [12]:
#
# Compute RF distances between supermatrix and supertree trees and ML trees from the full alignment of all the seqs
# without Mouse MSAs and structure-based MSAs

# this will hold all the RF distances averaged over all samples (10) of each unit vs the reference gene family paralog tree.
# there are (7) reference trees for a family in this case. because there are 7 species and each one of them might yeld a diffrent paralog tree.
# for this reason the unist_sample trees are compared to all (7) reference trees. 
mean_rf_me_seq_vs_ml_species_full_ref = c()
sd_rf_me_seq_vs_ml_species_full_ref   = c()

# create a dataframe with miscellaneuos informations about the gene family
df_fam_info = data.frame(
    family_name  = c(),
    num_seq      = c(),
    num_paralogs = c(),
    organism     = c()
)


# the directory prefix where all analazyd files are. this is created by main.nf pipeline in empirical data mode. 
seq_prefix = "results_without_mouse_10_samples/tcoffee/"

# iterate over all family specific output directories
for (family in pfam_dirs) {

    print(basename(family))

    # load the BIgtree MSA sequence names mapping scheme, aka tip 1 is pig_genename.
    codefile_full_aln_path = file.path(family, seq_prefix, "full_aln.code_name")
    codefile_full_aln      = read.table(codefile_full_aln_path)

    # load only the mouse MSA sequence names mapping scheme.
    codefile_path = file.path(family, seq_prefix, "file.code_name")
    codefile      = read.table(codefile_path)

    # get the codename/acronym for the organism/species in the gene family. 
    orgs_path = file.path(family, seq_prefix, paste0(basename(family), ".orthologs_org_ids_to_concatenate"))
    orgs      = read.table(orgs_path)

    # read the paralog gene names 
    paralogs_file_path = file.path(family, paste0(basename(family), ".intersecting_genes"))
    paralogs_file      = read.table(paralogs_file_path)

    # calculate numbers of: sequence in all family, number of species (7) and number of paralog per specie. 
    # All these three value are correlated: num_species * pralog_per_species = sequence_in_fam 
    num_seq_in_fam = length(codefile_full_aln[[1]])
    num_of_units   = length(orgs$V1) - 1
    num_paralogs   = length(paralogs_file$V1)

    # compile the dataframe with family name and organism present
    df_fam_info = data.frame(
        family_name  = c(df_fam_info$family_name, c(rep(basename(family), times=7))),       # repeat to match the size of organism names
        num_seq      = c(df_fam_info$num_seq, c(rep(num_seq_in_fam, times=7))),
        num_paralogs = c(df_fam_info$num_paralogs, c(rep(num_paralogs, times=7))),
        organism     = c(df_fam_info$organism, orgs[[1]])
    )

    # Get all units trees (60 = 6 organism/units (7 - maouse) * 10 samples/statistical repeats) for the combination Minimun Evolution (ME) + supermatrix method (SM).
    # make it to a binary tree using the multi2di package
    all_units_trees_seq_path = file.path(family, seq_prefix, "all_units.nwk")
    all_units_trees_seq      = read.tree(all_units_trees_seq_path, keep.multi=TRUE)
    all_units_trees_seq      = multi2di(all_units_trees_seq)

    # do the same as above for Maximum Likelihood (ML) + SM 
    all_units_raxml_trees_seq_path = file.path(family, seq_prefix, "all_units_RAxML_trees.nwk")
    all_units_raxml_trees_seq      = read.tree(all_units_raxml_trees_seq_path, keep.multi=TRUE)
    all_units_raxml_trees_seq      = multi2di(all_units_raxml_trees_seq)

    # do the same as above for ME + Super Tree (ST) method
    all_units_superfine_trees_seq_path = file.path(family, seq_prefix, "all_units_superfine_trees.nwk")
    all_units_superfine_trees_seq      = read.tree(all_units_superfine_trees_seq_path, keep.multi=TRUE)
    all_units_superfine_trees_seq      = multi2di(all_units_superfine_trees_seq)

    # do the same as above for ML + ST
    all_units_raxml_superfine_trees_seq_path = file.path(family, seq_prefix, "all_units_RAxML_superfine_trees.nwk")
    all_units_raxml_superfine_trees_seq      = read.tree(all_units_raxml_superfine_trees_seq_path, keep.multi=TRUE)
    all_units_raxml_superfine_trees_seq      = multi2di(all_units_raxml_superfine_trees_seq)

    # it will store all the species refernce paralog trees (7)
    all_species_raxml_subtrees_from_full_aln = c()
    
    # read the BigTree ML 
    species_raxml_tree_from_full_aln_path = file.path(family, seq_prefix, paste0("RAxML_bestTree.", basename(family), ".domain_sequences_prior_after_intersection_full_coded_raxml.nwk"))
    species_raxml_tree_from_full_aln = read.tree(species_raxml_tree_from_full_aln_path, keep.multi=FALSE)
    
    # for each tip in the above tree rename accordin to the mapping coding file for ML bigtree
    for (j in 1:length(species_raxml_tree_from_full_aln$tip.label)) { 
        species_raxml_tree_from_full_aln$tip.label[j] = as.character(codefile_full_aln$V1[which(codefile_full_aln$V2 == species_raxml_tree_from_full_aln[["tip.label"]][j])])
    }
    
    # do a for loop on the species (7) 
    for (species in orgs$V1) {
    
        # get the list of indices referring to the organism in question
        tips_to_keep = grep(species,species_raxml_tree_from_full_aln$tip.label)
    
        # extract the subtree (paralog tree) from the BigTree ML relating to the species in question
        test_subtree = keep.tip(species_raxml_tree_from_full_aln,tips_to_keep)
    
        # rename each tip sustituting the specific species codename with MOUSE, effectively going from ACAP2_AILME_1 to ACAP2_MOUSE_1
        # this makes all tips names from all species subtrees comparable altough they refer to a different sequence.
        # this is fine since only the correctness of the topology of the tree is tested later on. 
        test_subtree$tip.label = gsub(paste0("_", species, "_"), "_MOUSE_", test_subtree$tip.label)
    
        # the tips are further mapped to the C!, C2, C3, ecc.. nomenclature. going from  ACAP2_MOUSE_1 to C2
        for (j in 1:length(test_subtree$tip.label)) { 
            test_subtree$tip.label[j] = as.character(codefile$V2[which(codefile$V1 == test_subtree[["tip.label"]][j])])
        }
        
        # trasform the subtree into the ape multiPhylo class and added it to the list with all reference paralog trees.
        test_subtree = as.multiPhylo(test_subtree)
        all_species_raxml_subtrees_from_full_aln = c(all_species_raxml_subtrees_from_full_aln, test_subtree)
    }

    # it makes the vector into the ape Multyphylo class
    class(all_species_raxml_subtrees_from_full_aln) = "multiPhylo"

    # put into a vector the list of reference trees, ME + SM unit trees, ME + ST unit trees. The last two are the paralog trees to be tested for cprrectness of topology.
    # then again it's class is changed. There are 127 trees in the vector, the first 7 are the species reference , the other are 60 + 60 see above.
    me_seq_vs_ml_species_full_ref        = c(all_species_raxml_subtrees_from_full_aln, all_units_trees_seq, all_units_superfine_trees_seq)
    class(me_seq_vs_ml_species_full_ref) = "multiPhylo"
    
    # compute all against all Roubison-Fould distance, measure of topology difference. make it into a 127 * 127 matrix
    rf_me_seq_vs_ml_species_full_ref      = RF.dist(me_seq_vs_ml_species_full_ref, normalize=TRUE)
    rf_me_seq_vs_ml_species_full_ref_data = data.matrix(rf_me_seq_vs_ml_species_full_ref)

    # the 127*127 matrix has a lot of redundand and unrelevant comparisons
    # parse the above created data matrix to take only the RF dictance values necessary to keep:
    # (SM + ME, ST + ME) comparisons with the 7 reference trees of the family. 
    # for loop done (7) times.
    for (line in 1:length(orgs$V1)) {
    
        # get the first n line out of (7) lines of the matrix without the first the first (7) elements. 
        # aka all ME + SM unit trees, ME + ST unit trees camparison with the nth reference tree. 
        selected_line = rf_me_seq_vs_ml_species_full_ref_data[line, -c(1:length(orgs$V1))]
    
        # (6) species 10 sample/replicate 2 methods (SM + ME, ST + ME). this for loop is made to take all RF values of the 10 samples of a same unit/specie. 
        # is done so that all tree of a sample/replicate of the same unit are treated as such. 
        # for loop happens 12 times (2+ 6 units)
        for (i in seq(1, (num_of_units*10*2), by=10)) {
    
            # compute mean and sd across the 10 sample values and update the relevant lists.
            mean_rf_me_seq_vs_ml_species_full_ref = c(mean_rf_me_seq_vs_ml_species_full_ref, mean(selected_line[i:(i+9)]))
            sd_rf_me_seq_vs_ml_species_full_ref   = c(sd_rf_me_seq_vs_ml_species_full_ref, sd(selected_line[i:(i+9)]))
        }
    }
}

[1] "PF00004"
[1] "PF00019"
[1] "PF00022"
[1] "PF00025"
[1] "PF00041"
[1] "PF00059"
[1] "PF00061"
[1] "PF00092"
[1] "PF00102"
[1] "PF00104"
[1] "PF00106"
[1] "PF00122"
[1] "PF00134"
[1] "PF00167"
[1] "PF00168"
[1] "PF00178"
[1] "PF00179"
[1] "PF00250"
[1] "PF00307"
[1] "PF00373"
[1] "PF00531"
[1] "PF00566"
[1] "PF00595"
[1] "PF00615"
[1] "PF00620"
[1] "PF00621"
[1] "PF00622"
[1] "PF00640"
[1] "PF00777"
[1] "PF00782"
[1] "PF00787"
[1] "PF00788"
[1] "PF00822"
[1] "PF01392"
[1] "PF01412"
[1] "PF01421"
[1] "PF02214"
[1] "PF02931"
[1] "PF07679"
[1] "PF08205"
[1] "PF12796"


In [29]:
#
# Fig 4 Supermatrix-ME and Supertree-ME VS Big-Tree ML
#

# define how many units there are 
orgs_path      = file.path(family, seq_prefix, paste0(basename(family), ".orthologs_org_ids_to_concatenate"))
orgs           = read.table(orgs_path)
num_of_units   = length(orgs$V1) - 1

# reads the vector associated with mean RF distances computed in cel above and first it transform it to a matrix of size 287*12
# 287 is 41 gene families * 7 species. 12 is 2 methods (SM and ST) * 6 units
# so each column is a combination of unit and method. on column 1 there is unit1 for SM and on column 7 is unit1 for ST.
# each row is the comparison against a given a gene family species reference paralog tree. each family has 7  one for each species. 
df_mean_rf_me_seq_vs_ml_species_full_ref = as.data.frame(matrix(mean_rf_me_seq_vs_ml_species_full_ref, ncol = 2*num_of_units,  byrow = TRUE))

# goes through the dfby line and in chuncks of 7 lines makes the average of the values in the same column. 
# effectively averages the RF distance to all 7 reference species paralog trees for the same gene family. getting onlyt one value per family.
# the dataframe now has size 41 * 12 + 1.
df_mean_rf_me_seq_vs_ml_species_full_ref_avg_for_all_species = df_mean_rf_me_seq_vs_ml_species_full_ref %>%
  group_by(group = gl(n()/7, 7)) %>%
  summarise_all(mean, na.rm = TRUE)

# it removes the first column that had values from 1 to 41.
df_mean_rf_me_seq_vs_ml_species_full_ref_avg_for_all_species = df_mean_rf_me_seq_vs_ml_species_full_ref_avg_for_all_species[,-1]



In [30]:
df_mean_rf_me_seq_vs_ml_species_full_ref_avg_for_all_species = df_mean_rf_me_seq_vs_ml_species_full_ref_avg_for_all_species[,-1]
df_mean_rf_me_seq_vs_ml_species_full_ref_avg_for_all_species

V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0.6625,0.64375,0.6375,0.625,0.6125,0.625,0.6625,0.65,0.64375,0.625,0.6,0.61875
0.3473684,0.31052632,0.28421053,0.28421053,0.26842105,0.26315789,0.3473684,0.4105263,0.33157895,0.28947368,0.28421053,0.3
0.35,0.29,0.21,0.18,0.21,0.2,0.35,0.32,0.29,0.26,0.24,0.21
0.4533333,0.28,0.30666667,0.22,0.22,0.13333333,0.4533333,0.38,0.33333333,0.29333333,0.28666667,0.3
0.7816327,0.59591837,0.54897959,0.47040816,0.39183673,0.30612245,0.7816327,0.7163265,0.6377551,0.58571429,0.5,0.51122449
0.5461538,0.43846154,0.36153846,0.36923077,0.3,0.30769231,0.5461538,0.4384615,0.36153846,0.36923077,0.33076923,0.32307692
0.53,0.49,0.47,0.44,0.42,0.4,0.53,0.46,0.47,0.44,0.45,0.4
0.3,0.3,0.22857143,0.27142857,0.21428571,0.28571429,0.3,0.2857143,0.25714286,0.27142857,0.24285714,0.18571429
0.6222222,0.63333333,0.6,0.54444444,0.56666667,0.66666667,0.6222222,0.6222222,0.6,0.62222222,0.63333333,0.64444444
0.3117647,0.29411765,0.28235294,0.27058824,0.26470588,0.29411765,0.3117647,0.2764706,0.3,0.26470588,0.27647059,0.28235294


In [28]:
table(df_fam_info$num_seq)


 70  77  84  91  98 105 112 119 126 133 140 147 154 168 175 
 21  28  35  21  28   7  28  35   7   7  21  21  14   7   7 

In [5]:
table(df_fam_info$organism)


AILME BOVIN CANLF CAVPO CHLSB FICAL HORSE HUMAN LOXAF MELGA MOUSE MUSPF MYOLU 
   25    27    19     1    27     3    10    30     9     1    41    15     3 
ORENI ORYLA OTOGA PANTR   PIG RABIT   RAT SARHA SHEEP TAEGU TAKRU XIPMA 
    2     1    19     5    15     2    22     1     5     1     2     1 

In [6]:
table(df_fam_info$num_paralogs)


10 11 12 13 14 15 16 17 18 19 20 21 22 24 25 
21 28 35 21 28  7 28 35  7  7 21 21 14  7  7 

In [22]:
all_species_raxml_subtrees_from_full_aln

[[1]]

Phylogenetic tree with 21 tips and 20 internal nodes.

Tip labels:
  C2, C3, C5, C4, C11, C17, ...

Rooted; includes branch lengths.

[[2]]

Phylogenetic tree with 21 tips and 20 internal nodes.

Tip labels:
  C1, C2, C3, C5, C4, C11, ...

Rooted; includes branch lengths.

[[3]]

Phylogenetic tree with 21 tips and 20 internal nodes.

Tip labels:
  C2, C3, C5, C4, C11, C17, ...

Rooted; includes branch lengths.

[[4]]

Phylogenetic tree with 21 tips and 20 internal nodes.

Tip labels:
  C1, C2, C3, C5, C4, C11, ...

Rooted; includes branch lengths.

[[5]]

Phylogenetic tree with 21 tips and 20 internal nodes.

Tip labels:
  C1, C2, C3, C5, C4, C11, ...

Rooted; includes branch lengths.

[[6]]

Phylogenetic tree with 21 tips and 20 internal nodes.

Tip labels:
  C2, C3, C9, C5, C4, C11, ...

Rooted; includes branch lengths.

[[7]]

Phylogenetic tree with 21 tips and 20 internal nodes.

Tip labels:
  C1, C2, C3, C5, C4, C11, ...

Rooted; includes branch lengths.


In [21]:
all_units_trees_seq[[1]]


Phylogenetic tree with 21 tips and 20 internal nodes.

Tip labels:
  C6, C20, C10, C9, C18, C13, ...

Rooted; includes branch lengths.

In [10]:
seq(1, (num_of_species*10*2), by=10)

In [28]:
rf_me_seq_vs_ml_species_full_ref_data

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,⋯,118,119,120,121,122,123,124,125,126,127
1,0.0000000,0.0000000,0.0000000,0.0000000,0.0000000,0.3333333,0.0000000,0.8333333,0.8888889,0.8888889,⋯,0.7777778,0.8888889,0.7777778,0.8333333,0.8333333,0.7777778,0.6111111,0.8888889,0.8333333,0.7777778
2,0.0000000,0.0000000,0.0000000,0.0000000,0.0000000,0.3333333,0.0000000,0.8333333,0.8888889,0.8888889,⋯,0.7777778,0.8888889,0.7777778,0.8333333,0.8333333,0.7777778,0.6111111,0.8888889,0.8333333,0.7777778
3,0.0000000,0.0000000,0.0000000,0.0000000,0.0000000,0.3333333,0.0000000,0.8333333,0.8888889,0.8888889,⋯,0.7777778,0.8888889,0.7777778,0.8333333,0.8333333,0.7777778,0.6111111,0.8888889,0.8333333,0.7777778
4,0.0000000,0.0000000,0.0000000,0.0000000,0.0000000,0.3333333,0.0000000,0.8333333,0.8888889,0.8888889,⋯,0.7777778,0.8888889,0.7777778,0.8333333,0.8333333,0.7777778,0.6111111,0.8888889,0.8333333,0.7777778
5,0.0000000,0.0000000,0.0000000,0.0000000,0.0000000,0.3333333,0.0000000,0.8333333,0.8888889,0.8888889,⋯,0.7777778,0.8888889,0.7777778,0.8333333,0.8333333,0.7777778,0.6111111,0.8888889,0.8333333,0.7777778
6,0.3333333,0.3333333,0.3333333,0.3333333,0.3333333,0.0000000,0.3333333,0.8333333,0.8888889,0.8888889,⋯,0.7777778,0.8333333,0.7777778,0.7777778,0.8333333,0.7222222,0.6111111,0.8333333,0.7777778,0.7777778
7,0.0000000,0.0000000,0.0000000,0.0000000,0.0000000,0.3333333,0.0000000,0.8333333,0.8888889,0.8888889,⋯,0.7777778,0.8888889,0.7777778,0.8333333,0.8333333,0.7777778,0.6111111,0.8888889,0.8333333,0.7777778
8,0.8333333,0.8333333,0.8333333,0.8333333,0.8333333,0.8333333,0.8333333,0.0000000,0.7222222,0.7222222,⋯,0.8333333,0.8333333,0.9444444,0.7777778,0.8333333,0.8888889,0.8888889,0.7222222,0.8888889,0.8333333
9,0.8888889,0.8888889,0.8888889,0.8888889,0.8888889,0.8888889,0.8888889,0.7222222,0.0000000,0.7777778,⋯,0.7777778,0.6111111,0.8888889,0.7777778,0.6666667,0.7777778,0.7777778,0.7222222,0.7222222,0.7777778
10,0.8888889,0.8888889,0.8888889,0.8888889,0.8888889,0.8888889,0.8888889,0.7222222,0.7777778,0.0000000,⋯,0.7777778,0.7222222,0.8333333,0.8888889,0.7222222,0.8333333,0.8333333,0.6666667,0.7222222,0.8333333


In [14]:
length(mean_rf_me_seq_vs_ml_species_full_ref)

In [26]:
tmp = c("f_1 s_1 u_1 m_1", "f_1 s_1 u_2 m_1", "f_1 s_1 u_1 m_2", "f_1 s_1 u_2 m_2", 
        "f_1 s_2 u_1 m_1", "f_1 s_2 u_2 m_1", "f_1 s_2 u_1 m_2", "f_1 s_2 u_2 m_2", 
        "f_1 s_3 u_1 m_1", "f_1 s_3 u_2 m_1", "f_1 s_3 u_1 m_2", "f_1 s_3 u_2 m_2", 
        "f_2 s_1 u_1 m_1", "f_2 s_1 u_2 m_1", "f_2 s_1 u_1 m_2", "f_2 s_1 u_2 m_2", 
        "f_2 s_2 u_1 m_1", "f_2 s_2 u_2 m_1", "f_2 s_2 u_1 m_2", "f_2 s_2 u_2 m_2", 
        "f_2 s_3 u_1 m_1", "f_2 s_3 u_2 m_1", "f_2 s_3 u_1 m_2", "f_2 s_3 u_2 m_2"
       )
length(tmp)

as.data.frame(matrix(tmp, ncol=4,  byrow = TRUE))

V1,V2,V3,V4
<chr>,<chr>,<chr>,<chr>
f_1 s_1 u_1 m_1,f_1 s_1 u_2 m_1,f_1 s_1 u_1 m_2,f_1 s_1 u_2 m_2
f_1 s_2 u_1 m_1,f_1 s_2 u_2 m_1,f_1 s_2 u_1 m_2,f_1 s_2 u_2 m_2
f_1 s_3 u_1 m_1,f_1 s_3 u_2 m_1,f_1 s_3 u_1 m_2,f_1 s_3 u_2 m_2
f_2 s_1 u_1 m_1,f_2 s_1 u_2 m_1,f_2 s_1 u_1 m_2,f_2 s_1 u_2 m_2
f_2 s_2 u_1 m_1,f_2 s_2 u_2 m_1,f_2 s_2 u_1 m_2,f_2 s_2 u_2 m_2
f_2 s_3 u_1 m_1,f_2 s_3 u_2 m_1,f_2 s_3 u_1 m_2,f_2 s_3 u_2 m_2


In [27]:
df_mean_rf_me_seq_vs_ml_species_full_ref = as.data.frame(matrix(mean_rf_me_seq_vs_ml_species_full_ref, ncol = 2*num_of_units,  byrow = TRUE))

df_mean_rf_me_seq_vs_ml_species_full_ref_avg_for_all_species = df_mean_rf_me_seq_vs_ml_species_full_ref %>%
  group_by(group = gl(n()/7, 7)) %>%
  summarise_all(mean, na.rm = TRUE)

df_mean_rf_me_seq_vs_ml_species_full_ref_avg_for_all_species

group,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,0.6625,0.64375,0.6375,0.625,0.6125,0.625,0.6625,0.65,0.64375,0.625,0.6,0.61875
2,0.3473684,0.31052632,0.28421053,0.28421053,0.26842105,0.26315789,0.3473684,0.4105263,0.33157895,0.28947368,0.28421053,0.3
3,0.35,0.29,0.21,0.18,0.21,0.2,0.35,0.32,0.29,0.26,0.24,0.21
4,0.4533333,0.28,0.30666667,0.22,0.22,0.13333333,0.4533333,0.38,0.33333333,0.29333333,0.28666667,0.3
5,0.7816327,0.59591837,0.54897959,0.47040816,0.39183673,0.30612245,0.7816327,0.7163265,0.6377551,0.58571429,0.5,0.51122449
6,0.5461538,0.43846154,0.36153846,0.36923077,0.3,0.30769231,0.5461538,0.4384615,0.36153846,0.36923077,0.33076923,0.32307692
7,0.53,0.49,0.47,0.44,0.42,0.4,0.53,0.46,0.47,0.44,0.45,0.4
8,0.3,0.3,0.22857143,0.27142857,0.21428571,0.28571429,0.3,0.2857143,0.25714286,0.27142857,0.24285714,0.18571429
9,0.6222222,0.63333333,0.6,0.54444444,0.56666667,0.66666667,0.6222222,0.6222222,0.6,0.62222222,0.63333333,0.64444444
10,0.3117647,0.29411765,0.28235294,0.27058824,0.26470588,0.29411765,0.3117647,0.2764706,0.3,0.26470588,0.27647059,0.28235294
