# Analysis of Mislabeled, swapped data


This jupyter notebook analyzes the trees obtained from mislabelled (aka swapped) sequences. The sequences are from the simulated dataset. Mislabaling a paralog is equivalent to getting two paralogs and swap them while keeping their label (aka gene/paralog-ID) untouched. This will generate as a consequence erroneus orthologs assignments. Since it is the simulated dataset ortholog assignment are known a priori and are ground trouth, to simulate a real world scenario wHere there is more confusion and mistakes about orthologs assignments, we decided to swap some sequences to create wrong ortholog sets. 
<br><br>
The swapping happens in two steps for each family in the simulated dataset (500): first select the species to do the swapping in, then among the paralogs in that species actually swap the desired number of sequences around. Both species selection and sequence selection is random. So the swapping combination (10, 9) refers to selcting 10 species in which 9 sequences will be swapped. The 9 sequences are swapped among themselves.


## Table of Contents
- [1. Chapter 1 - Introduction](#chapter-1-introduction)
- [2. Chapter 2 - R packages](#chapter-2-R-packages)
- [3. Chapter 3 - Input specification](#chapter-3-Input-specification)
- [4. Chapter 4 - Swapped label vs true topologies](#chapter-4-Swapped-label-vs-true-topologies)

<a id="chapter-1-introduction"></a>

## Chapter 1 - Introduction

TODO add description of notebook and how to use it.


<a id="chapter-2-R-packages"></a>

## Chapter 2 - R packages

Here is the list of packages needed for this notebook.

In [1]:
library("phangorn")
library("ggplot2")
library("reshape2")
library("geiger")
library("adephylo")
library("phytools")
library("rlist")
library("plotrix")
library("plyr")
library("dplyr")
library("hrbrthemes")
library("rstatix")
library("gplots")
library("viridis")
library('gtools')
library(gridExtra)
library(grid)
library("ggforce")
library("scales")
library("pROC")
library("tidyr")
library("stringr")
library("patchwork")
library("ggarchery")

Loading required package: ape

Loading required package: phytools

Loading required package: maps

Loading required package: ade4


Attaching package: ‘plotrix’


The following object is masked from ‘package:phytools’:

    rescale



Attaching package: ‘plyr’


The following object is masked from ‘package:maps’:

    ozone



Attaching package: ‘dplyr’


The following objects are masked from ‘package:plyr’:

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize


The following object is masked from ‘package:ape’:

    where


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



Attaching package: ‘rstatix’


The following objects are masked from ‘package:plyr’:

    desc, mutate


The following object is masked from ‘package:stats’:

    filter



Attaching package: ‘gplots’


The following object is masked from ‘package:plotrix’:

    plotC

<a id="chapter-3-Input-specification"></a>

## Chapter 3 - Input specification

Change the parent_of_all_dir variable to give the location of the input. All input files are expected to be under it. Otherwise change the single input location to you liking.<br>
All input files or directory paths are listed here.

In [112]:
# the location of the directory that contains all inputs for analyses 
parent_of_all_dir = "/users/cn/avignoli/1_paralogs/simulated_datasets/simulated_sequences/"

# The directory (one or more) that contain all the results from the main.nf pipeline run. It can be a glob path.
dir_containing_all_pipe_results = paste0(parent_of_all_dir, "rerun_25_species_corrected_branches/swapped_data/group*")

# The file containing the rename mapping of the reference paralog tree (generic)
reference_tree_codefile = file.path(parent_of_all_dir, "ref_tree.code")

# get the species/organism list of names used by the main.nf pipeline
species_name_file = file.path(parent_of_all_dir, "orthologs_org_ids_to_concatenate")


# each run of main.nf has one output directory. here we get that or multiple runs outputs directories.
run_dirs = Sys.glob(dir_containing_all_pipe_results)
run_dirs = run_dirs[file.info(run_dirs)$isdir]

# Create the 'figures' directory if it doesn't exist. Created where this notebook is executed.
if (!dir.exists("figures")) {
    dir.create("figures")
}

# Create the 'tables' directory if it doesn't exist. Created where this notebook is executed.
if (!dir.exists("tables")) {
    dir.create("tables")
}

<a id="chapter-4-Swapped-label-vs-true-topologies)"></a>

## Chapter 4 - Swapped label vs true topologies

TODO add description 

In [113]:
# later on to this file at each family iteretion a df rows will be appended. to avoid that running this cell multiple times might append to the file over and over this line is here to write the column name as well as overwrite an existing file so it is fresh.
write.table(data.frame(method = character(), group = character(), family = character(), combination = character(), rf_val = numeric()),  
            file = "tables/RF_swapped_set_all_methods_vs_reference.tsv",  
            quote = FALSE, sep = "\t", row.names = FALSE, col.names = TRUE)

for (run in run_dirs) {
    print(basename(run))

    # the mapping used later on to rename species trees according to a more generic sequence name
    ref_codefile = read.table(file = reference_tree_codefile )
    
    # get all the family subdirectories in each main.nf output run dir. 
    # This will contain the 100 normal families directories + all directories of mislabeled families.
    # There are (90) combination of 10 species and 10 sequences swaps (no swap on one sequence) per family, so 9100 dirs in each group.
    fam_dirs = grep("avg_branchlen_0.7_protein_family*", list.dirs(run, recursive = FALSE), value = TRUE)
    count_fam = -1

    # read the names of the ortholog species.
    orgs = read.table(file = species_name_file)

    # setting global df that will hold family level rf values.
    df_all_rf = data.frame(
        method      = c(),
        group       = c(),
        family      = c(),
        combination = c(),
        rf_val      = c()
    )

    # get all unit25 paralog trees for all combination of methods ME,ML + SM,ST
    for (m in fam_dirs) {
        
        count_fam = count_fam + 1

        # print every 10 families a check on progress to screen
        if (count_fam %% 910 == 0) {
            print(m)
        }

        # work only on swapped families directories
        if (grepl('swap_', m)) {

            # each directory has two subdirectories in which the data are stored
            tree_dir = paste0(m, "/results/rerun")

            # Read the paralog trees
            #Minimun Evolution (ME) + supermatrix method (SM)
            me_sm_path = list.files(path = tree_dir, pattern = glob2rx("*supermatrix.phylip_fastme_tree.nwk"), full.names = TRUE)
            me_sm_file = file.path(me_sm_path)
            me_sm_tree = read.tree(me_sm_file)
            me_sm_tree = unroot(me_sm_tree)

            # ME + supertree method (ST)
            me_st_path = list.files(path = tree_dir, pattern = glob2rx("avg_branchlen_*_paralog_tree.nwk"), full.names = TRUE)
            me_st_file = file.path(me_st_path)
            me_st_tree = read.tree(me_st_file)
            me_st_tree = unroot(me_st_tree)

            # Maximum Likelihood (ML) + SM
            ml_sm_path = list.files(path = tree_dir, pattern = glob2rx("RAxML_bestTree*_supermatrix_raxml.nwk"), full.names = TRUE)
            ml_sm_file = file.path(ml_sm_path)
            ml_sm_tree = read.tree(ml_sm_file)
            ml_sm_tree = unroot(ml_sm_tree)

            # ML + ST
            ml_st_path = list.files(path = tree_dir, pattern = glob2rx("RAxML_bestTree*_paralog_tree.nwk"), full.names = TRUE)
            ml_st_file = file.path(ml_st_path)
            ml_st_tree = read.tree(ml_st_file)
            ml_st_tree = unroot(ml_st_tree)

            # check if there any non-bynary (consensus trees) among the 4 trees present in the files above
            all_method_trees = c(me_sm_tree, me_st_tree, ml_sm_tree, ml_st_tree)
            if (length(which(is.binary(all_method_trees) == FALSE)) != 0) {
                
                # if there are non-binary trees, iterate through them and transformed into a binary one using multi2di package
                for (tree in which(is.binary(all_method_trees) == FALSE)) {
                    all_method_trees[[tree]] = multi2di(all_method_trees[[tree]])
                }
            }

            # read the reference tree associated with this family.
            ref_tree_file = file.path(paste0(strsplit(m, '_swap_')[[1]][1], '.nwk'))
            ref_tree      = read.tree(ref_tree_file)

            # Since each gene family has a predifined gene tree (ref_tree) composed of a fixed species tree pasted as leaf of the paralog tree. The tree is designed like this by choice. Each gene families varies only in the paralog tree that it has that is unique (15 tips).
            # Here we extract the paralog tree and to do so we just need to extract the subtree with all the sequences of omne species. Since each species have the same paralog tree. So we can extract the subtree of whichever species it does not matter.
            ref_subtree   = keep.tip(ref_tree, as.character(ref_codefile$V1))
    
            # we then rename the tips of the paralog tree (ref_subtree) to a more generic keyname. Effectively going from rat1 to seq1. 
            for (j in 1:length(ref_subtree$tip.label)) { 
                ref_subtree$tip.label[j] = as.character(ref_codefile$V2[which(ref_codefile$V1 == ref_subtree[["tip.label"]][j])])
            }
    
            # Unroot the reference tree 
            ref_subtree = unroot(ref_subtree)

            # compute the RF distance matrix, all against all. 5 tree in total.
            all_trees        = c(ref_subtree,
                                 me_sm_tree,
                                 me_st_tree,
                                 ml_sm_tree,
                                 ml_st_tree
                                )
            class(all_trees) = "multiPhylo"
            all_vs_all_rf    = RF.dist(all_trees, normalize = TRUE)

            # update the family df with rf values. the first fopur columns together will act as an ID for tht tree.
            df_all_rf = data.frame(
                        method      = c(df_all_rf$method, c("SM_ME", "ST_ME", "SM_ML", "ST_ML")),
                        group       = c(df_all_rf$group, rep(basename(run), each = 4)),
                        family      = c(df_all_rf$family, rep(strsplit(basename(m), '_swap_')[[1]][1], each = 4)),
                        combination = c(df_all_rf$combination, rep(strsplit(basename(m), '_swap_')[[1]][2], each = 4)),
                        rf_val      = c(df_all_rf$rf_val,  c(all_vs_all_rf[2], all_vs_all_rf[2], all_vs_all_rf[3], all_vs_all_rf[4]))
            )

        # every time a non combination dir is encountered write to file what has been stored up untill now and reset the df
        } else {

            # appending the table with family rf values to the file. done this way so it is faster
            write.table(df_all_rf, file = "tables/RF_swapped_set_all_methods_vs_reference.tsv", quote = FALSE, sep = "\t", row.names = FALSE, col.names = FALSE, append = TRUE)

            # empty df
            df_all_rf = data.frame(
                method      = c(),
                group       = c(),
                family      = c(),
                combination = c(),
                rf_val      = c()
            )
        }
    }

    # appending the last family combinations of that group
    write.table(df_all_rf, file = "tables/RF_swapped_set_all_methods_vs_reference.tsv", quote = FALSE, sep = "\t", row.names = FALSE, col.names = FALSE, append = TRUE)

} 

[1] "group1_23th"
[1] "/users/cn/avignoli/1_paralogs/simulated_datasets/simulated_sequences/rerun_25_species_corrected_branches/swapped_data/group1_23th/avg_branchlen_0.7_protein_family_001.sub"
[1] "/users/cn/avignoli/1_paralogs/simulated_datasets/simulated_sequences/rerun_25_species_corrected_branches/swapped_data/group1_23th/avg_branchlen_0.7_protein_family_011.sub"
[1] "/users/cn/avignoli/1_paralogs/simulated_datasets/simulated_sequences/rerun_25_species_corrected_branches/swapped_data/group1_23th/avg_branchlen_0.7_protein_family_021.sub"
[1] "/users/cn/avignoli/1_paralogs/simulated_datasets/simulated_sequences/rerun_25_species_corrected_branches/swapped_data/group1_23th/avg_branchlen_0.7_protein_family_031.sub"
[1] "/users/cn/avignoli/1_paralogs/simulated_datasets/simulated_sequences/rerun_25_species_corrected_branches/swapped_data/group1_23th/avg_branchlen_0.7_protein_family_041.sub"
[1] "/users/cn/avignoli/1_paralogs/simulated_datasets/simulated_sequences/rerun_25_species_correc