# Adjacency weights

In this experiment, we will use the following as input.
1. Species tree (generated by ZOMBI)
2. Gene order for extant genomes (generated by ZOMBI)
3. Gene content for ancestral genomes (obtained through ancestral gene orders generated by ZOMBI)
4. Reconciled gene trees for each gene family (generated by ZOMBI)

Here, we use DeCoSTAR and DeClone for sampling and obtaining adjacency weights. This experiment adds some noise to the reconstruction by using a larger set of potential adjacencies with non-uniform weights. However, the compared genomes will have exactly the same gene content as their ZOMBI-simulated counterparts.

In this experiment, the species tree topology is the same for all runs of the ILP.

### Experiment set up

#### Simulations

We use three sets of simulations:
1. Without loss and low rates of rearrangement (../../../sim/without_LT)
2. Without loss and higher rates of rearrangement (../../../sim/without_LT_high_rearr)
3. With gene loss as an event and low rates of rearrangment (../../../sim/with_L)

For each set of simulations, we have 10 extant genomes. The Root genome contains 100 gene families. Each set has 200 duplications over all branches of the species tree. The rates of rearrangement for set 1 and 3 are 100 inversions and 100 translocations over all branches of the species tree. For set 3, the rate of gene loss is 100 genes over all branches. The exact parameters for the sets can be found under the directory "../../../code/ZOMBI_old" in the folders "no_loss_params", "no_loss_high_rearr_params" and "with_loss_params" respectively.  

While generating the species tree, the option for extinction of species has been muted. Also, the option for horizontal gene transfer has also been muted for this experiment.

Note: The adjacencies have been obtained through two runs of DeCoSTAR. In some cases, DeCoSTAR provided the output but with warnings about overflows in random backtracking. The details of the analysis of DeCoSTAR output have been provided in "../../../doc/DeCoSTAR_results_prelim_analysis.ipynb"

### ILP

The ILP is run with the linearization parameter $\alpha \in \{0, 0.25, 0.5, 0.75, 1\}$ for each of the 20 runs. For each combination (Run, $\alpha$), we compare the adjacency sets provided by the ILP to the true adjacencies from ZOMBI genomes. We generate precision-recall statistics and compare them for each $\alpha$ value. For one solution selected by the ILP, we also output the gene order for each species and the cuts and joins involved for the solution.

In [1]:
from IPython.display import Image
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

In [2]:
import os
from collections import defaultdict
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
plt.switch_backend('agg')
%matplotlib inline

In [3]:
with_L = "../output/with_L"
no_LT = "../output/without_LT"
no_LT_high_rearr = "../output/without_LT_high_rearr"

prec, rec, F1 = {}, {}, {}
dist, cuts, joins , dups = {}, {}, {}, {}
b_dist, b_cuts, b_joins , b_dups = {}, {}, {}, {}

In [4]:
def update_dict(line, stat_dict, alpha):
    stat = line.split(" ")[-1]
    stat_dict[alpha].append(float(stat))
    return stat_dict

def append_dist(line, d1, d2, d3, d4, alpha):
    s1, s2, s3 = line.split("\t")[1], line.split("\t")[2], line.split("\t")[3]
    d1[alpha].append(float(s1))
    d2[alpha].append(float(s2))
    d3[alpha].append(float(s3))
    d4[alpha].append(float(s1) - (float(s2)+float(s3))/2)
    return d1, d2, d3, d4

def append_b_dist(line, d1, d2, d3, d4, alpha):
    b = line.split("\t")[0]
    #print(b)
    s1, s2, s3 = line.split("\t")[1], line.split("\t")[2], line.split("\t")[3]

    if b not in d1[alpha]:
        d1[alpha][b], d2[alpha][b], d3[alpha][b], d4[alpha][b] = {}, {}, {}, {}
        d1[alpha][b], d2[alpha][b], d3[alpha][b], d4[alpha][b] = [], [], [], []
    d1[alpha][b].append(float(s1))
    d2[alpha][b].append(float(s2))
    d3[alpha][b].append(float(s3))
    d4[alpha][b].append((float(s1) - (float(s2)+float(s3)))/2)
    return d1, d2, d3, d4


## No Loss and Transfer

The following table lists the average precision, recall and F1 score over 20 runs. The precision is best for lower values of $\alpha$. On the other hand the recall is progressively lower for lower values of $\alpha$. Except for $\alpha = 0$, the recall is consistently above 0.86 and the F1 score above 0.92. In each run, the F1 scores for $\alpha = 1$ are marginally better than those for $\alpha=0.75$. The precision is consistently close to 1.

In [5]:
#no_LT
prec['no_LT'], rec['no_LT'], F1['no_LT'] = defaultdict(list), defaultdict(list), defaultdict(list)
dist['no_LT'], cuts['no_LT'], joins['no_LT'], dups['no_LT'] = defaultdict(list), defaultdict(list), defaultdict(list), defaultdict(list)
b_dist['no_LT'], b_cuts['no_LT'], b_joins['no_LT'], b_dups['no_LT'] = defaultdict(dict), defaultdict(dict), defaultdict(dict), defaultdict(dict)

files = [os.path.join(dp, f) for dp, dn, filenames in os.walk(no_LT) for f in filenames if "stats" in f]
for file in files:
    run = file.split("/")[3].split("_")[1]
    alpha = file.split("/")[3].split("_")[-1]
    with open(file, 'r') as f:
        for line in f:
            if "Precision" in line:
                prec['no_LT'] = update_dict(line, prec['no_LT'], alpha)
            elif "Recall" in line:
                rec['no_LT'] = update_dict(line, rec['no_LT'], alpha)
            elif "F1_score:" in line:
                F1['no_LT'] = update_dict(line, F1['no_LT'], alpha)
            elif "(" in line and "None" not in line:
                if len(line.split("\t")) > 3:
                    b_dist['no_LT'], b_cuts['no_LT'], b_joins['no_LT'], b_dups['no_LT'] = append_b_dist(line, b_dist['no_LT'], b_cuts['no_LT'], b_joins['no_LT'], b_dups['no_LT'], alpha)

            elif "Overall" in line:
                dist['no_LT'], cuts['no_LT'], joins['no_LT'], dups['no_LT'] = append_dist(line, dist['no_LT'], cuts['no_LT'], joins['no_LT'], dups['no_LT'], alpha)

In [6]:
d_per_branch = {}
dist_scores = {}
d_per_branch['no_LT'] = {}
dist_scores['no_LT'] = {}
for alpha in b_dist['no_LT']:
    d_per_branch['no_LT'][alpha] = {}
    dist_scores['no_LT'][alpha] = []
    for branch in b_dist['no_LT'][alpha]:
        #print(branch)
        d_per_branch['no_LT'][alpha][branch] = {}
        d_per_branch['no_LT'][alpha][branch]['dist'] = sum(b_dist['no_LT'][alpha][branch])/len(b_dist['no_LT'][alpha][branch])
        d_per_branch['no_LT'][alpha][branch]['cuts'] = sum(b_cuts['no_LT'][alpha][branch])/len(b_dist['no_LT'][alpha][branch])
        d_per_branch['no_LT'][alpha][branch]['joins'] = sum(b_joins['no_LT'][alpha][branch])/len(b_dist['no_LT'][alpha][branch])
        d_per_branch['no_LT'][alpha][branch]['dups'] = sum(b_dups['no_LT'][alpha][branch])/len(b_dist['no_LT'][alpha][branch])
        dist_scores['no_LT'][alpha].append([branch, d_per_branch['no_LT'][alpha][branch]['dist'], d_per_branch['no_LT'][alpha][branch]['cuts'], d_per_branch['no_LT'][alpha][branch]['joins'], d_per_branch['no_LT'][alpha][branch]['dups']])
    dist_scores['no_LT'][alpha] = pd.DataFrame(dist_scores['no_LT'][alpha]) 
    dist_scores['no_LT'][alpha].rename(columns = {0: 'branch', 1: 'SCJTDFD', 2: 'Cuts', 3: 'Joins', 4: 'Dups'}, inplace = True) 

In [7]:
mean = {}
mean_scores = {}
mean['no_LT'] = {}
mean_scores['no_LT'] = []
for alpha in prec['no_LT']:
    mean['no_LT'][alpha] = {}
    mean['no_LT'][alpha]['precision'] = sum(prec['no_LT'][alpha])/len(prec['no_LT'][alpha])
    mean['no_LT'][alpha]['recall'] = sum(rec['no_LT'][alpha])/len(rec['no_LT'][alpha])
    mean['no_LT'][alpha]['f1_score'] = sum(F1['no_LT'][alpha])/len(F1['no_LT'][alpha])
    mean_scores['no_LT'].append([alpha, mean['no_LT'][alpha]['precision'], mean['no_LT'][alpha]['recall'], mean['no_LT'][alpha]['f1_score']])

mean_scores['no_LT'] = pd.DataFrame(mean_scores['no_LT'])
mean_scores['no_LT'].rename(columns = {0: 'alpha', 1: 'Precision', 2: 'Recall', 3: 'F1 score'}, inplace = True) 
mean_scores['no_LT'] = mean_scores['no_LT'].sort_values(by=['alpha'])

### Mean statistics

In [8]:
mean_scores['no_LT']

Unnamed: 0,alpha,Precision,Recall,F1 score
1,0.0,0.999554,0.84588,0.916172
4,0.25,0.998721,0.9278,0.961908
0,0.5,0.998665,0.944252,0.970673
2,0.75,0.997941,0.951681,0.974228
3,1.0,0.996579,0.955503,0.97555


### Branch wise distance for $\alpha=0$ for without_LT runs

In [9]:
dist_scores['no_LT'][str(0)].sort_values(by=['branch'])

Unnamed: 0,branch,SCJTDFD,Cuts,Joins,Dups
17,"(Root,n1)",24.15,4.0625,13.6625,3.2125
15,"(Root,n2)",26.4125,5.325,15.7375,2.675
5,"(n1,n5)",52.6375,8.675,35.3625,4.3
11,"(n1,n6)",16.2375,4.7625,11.475,0.0
8,"(n10,n13)",64.525,10.525,44.0,5.0
2,"(n10,n14)",67.475,12.725,45.65,4.55
3,"(n2,n3)",26.5375,9.7625,14.275,1.25
7,"(n2,n4)",148.0375,20.4125,98.125,14.75
16,"(n3,n10)",84.9625,15.95,52.8875,8.0625
10,"(n3,n9)",128.425,18.8,86.075,11.775


### Branch wise distance for $\alpha=0.25$ for without_LT runs

In [10]:
dist_scores['no_LT'][str(0.25)].sort_values(by=['branch'])

Unnamed: 0,branch,SCJTDFD,Cuts,Joins,Dups
17,"(Root,n1)",30.3625,2.3875,21.8,3.0875
15,"(Root,n2)",28.025,3.9375,18.9875,2.55
5,"(n1,n5)",48.775,16.0,24.175,4.3
11,"(n1,n6)",10.15,7.0375,3.1125,0.0
8,"(n10,n13)",64.2625,14.425,39.8375,5.0
2,"(n10,n14)",67.7375,16.8875,41.75,4.55
3,"(n2,n3)",21.2375,7.6,11.1875,1.225
7,"(n2,n4)",155.625,37.75,88.375,14.75
16,"(n3,n10)",83.1375,24.0875,42.925,8.0625
10,"(n3,n9)",131.6625,32.975,75.1375,11.775


### Branch wise distance for $\alpha=0.5$ for without_LT runs

In [11]:
dist_scores['no_LT'][str(0.5)].sort_values(by=['branch'])

Unnamed: 0,branch,SCJTDFD,Cuts,Joins,Dups
17,"(Root,n1)",34.0125,2.4875,25.35,3.0875
15,"(Root,n2)",27.6625,2.8875,19.7,2.5375
5,"(n1,n5)",50.7875,18.375,23.7625,4.325
11,"(n1,n6)",10.325,7.8875,2.4375,0.0
8,"(n10,n13)",64.325,15.575,38.75,5.0
2,"(n10,n14)",67.675,17.975,40.6,4.55
3,"(n2,n3)",20.5875,8.2,9.9375,1.225
7,"(n2,n4)",158.4875,41.7,87.2875,14.75
16,"(n3,n10)",83.5375,24.7625,42.65,8.0625
10,"(n3,n9)",133.7875,35.5125,74.725,11.775


### Branch wise distance for $\alpha=0.75$ for without_LT runs

In [12]:
dist_scores['no_LT'][str(0.75)].sort_values(by=['branch'])

Unnamed: 0,branch,SCJTDFD,Cuts,Joins,Dups
17,"(Root,n1)",35.1125,2.975,25.9625,3.0875
15,"(Root,n2)",28.5375,3.0125,20.475,2.525
5,"(n1,n5)",51.9,18.8375,24.4125,4.325
11,"(n1,n6)",8.7125,6.4125,2.3,0.0
8,"(n10,n13)",64.9,16.8125,38.0875,5.0
2,"(n10,n14)",67.825,19.0,39.725,4.55
3,"(n2,n3)",18.6625,6.4125,9.75,1.25
7,"(n2,n4)",159.5875,43.125,86.9625,14.75
16,"(n3,n10)",87.0125,27.225,43.6625,8.0625
10,"(n3,n9)",135.925,37.8375,74.4875,11.8


### Branch wise distance for $\alpha=1$ for without_LT runs

In [13]:
dist_scores['no_LT'][str(1)].sort_values(by=['branch'])

Unnamed: 0,branch,SCJTDFD,Cuts,Joins,Dups
17,"(Root,n1)",379.0625,346.7,26.1375,3.1125
15,"(Root,n2)",368.1,342.1,20.825,2.5875
5,"(n1,n5)",417.7875,384.5625,24.575,4.325
11,"(n1,n6)",368.9125,366.525,2.3875,0.0
8,"(n10,n13)",495.9875,448.025,37.9625,5.0
2,"(n10,n14)",499.0125,450.15,39.7625,4.55
3,"(n2,n3)",377.6875,365.525,9.6625,1.25
7,"(n2,n4)",544.4375,428.0625,86.875,14.75
16,"(n3,n10)",463.4375,403.0875,44.1,8.125
10,"(n3,n9)",527.9125,429.5125,74.475,11.9625


## With loss

When gene loss is allowed as an event, the recall takes a significant drop as compared to the previous case. The precision however is still bove 96%. The F1 score consistently rises with $\alpha$, with the case $\alpha=1$ being slightly better than $\alpha=0.75$.

In [14]:
#with_L
prec['with_L'], rec['with_L'], F1['with_L'] = defaultdict(list), defaultdict(list), defaultdict(list)
dist['with_L'], cuts['with_L'], joins['with_L'], dups['with_L'] = defaultdict(list), defaultdict(list), defaultdict(list), defaultdict(list)
b_dist['with_L'], b_cuts['with_L'], b_joins['with_L'], b_dups['with_L'] = defaultdict(dict), defaultdict(dict), defaultdict(dict), defaultdict(dict)

files = [os.path.join(dp, f) for dp, dn, filenames in os.walk(with_L) for f in filenames if "stats" in f]
for file in files:
    run = file.split("/")[3].split("_")[1]
    alpha = file.split("/")[3].split("_")[-1]
    with open(file, 'r') as f:
        for line in f:
            if "Precision" in line:
                prec['with_L'] = update_dict(line, prec['with_L'], alpha)
            elif "Recall" in line:
                rec['with_L'] = update_dict(line, rec['with_L'], alpha)
            elif "F1_score:" in line:
                F1['with_L'] = update_dict(line, F1['with_L'], alpha)
            elif "(" in line and "None" not in line:
                if len(line.split("\t")) > 3:
                    b_dist['with_L'], b_cuts['with_L'], b_joins['with_L'], b_dups['with_L'] = append_b_dist(line, b_dist['with_L'], b_cuts['with_L'], b_joins['with_L'], b_dups['with_L'], alpha)

            elif "Overall" in line:
                dist['with_L'], cuts['with_L'], joins['with_L'], dups['with_L'] = append_dist(line, dist['with_L'], cuts['with_L'], joins['with_L'], dups['with_L'], alpha)

In [15]:
d_per_branch = {}
dist_scores = {}
d_per_branch['with_L'] = {}
dist_scores['with_L'] = {}
for alpha in b_dist['with_L']:
    d_per_branch['with_L'][alpha] = {}
    dist_scores['with_L'][alpha] = []
    for branch in b_dist['with_L'][alpha]:
        #print(branch)
        d_per_branch['with_L'][alpha][branch] = {}
        d_per_branch['with_L'][alpha][branch]['dist'] = sum(b_dist['with_L'][alpha][branch])/len(b_dist['with_L'][alpha][branch])
        d_per_branch['with_L'][alpha][branch]['cuts'] = sum(b_cuts['with_L'][alpha][branch])/len(b_dist['with_L'][alpha][branch])
        d_per_branch['with_L'][alpha][branch]['joins'] = sum(b_joins['with_L'][alpha][branch])/len(b_dist['with_L'][alpha][branch])
        d_per_branch['with_L'][alpha][branch]['dups'] = sum(b_dups['with_L'][alpha][branch])/len(b_dist['with_L'][alpha][branch])
        dist_scores['with_L'][alpha].append([branch, d_per_branch['with_L'][alpha][branch]['dist'], d_per_branch['with_L'][alpha][branch]['cuts'], d_per_branch['with_L'][alpha][branch]['joins'], d_per_branch['with_L'][alpha][branch]['dups']])
    dist_scores['with_L'][alpha] = pd.DataFrame(dist_scores['with_L'][alpha]) 
    dist_scores['with_L'][alpha].rename(columns = {0: 'branch', 1: 'SCJTDFD', 2: 'Cuts', 3: 'Joins', 4: 'Dups'}, inplace = True) 

In [16]:
mean = {}
mean_scores = {}
mean['with_L'] = {}
mean_scores['with_L'] = []
for alpha in prec['with_L']:
    mean['with_L'][alpha] = {}
    mean['with_L'][alpha]['precision'] = sum(prec['with_L'][alpha])/len(prec['with_L'][alpha])
    mean['with_L'][alpha]['recall'] = sum(rec['with_L'][alpha])/len(rec['with_L'][alpha])
    mean['with_L'][alpha]['f1_score'] = sum(F1['with_L'][alpha])/len(F1['with_L'][alpha])
    mean_scores['with_L'].append([alpha, mean['with_L'][alpha]['precision'], mean['with_L'][alpha]['recall'], mean['with_L'][alpha]['f1_score']])

mean_scores['with_L'] = pd.DataFrame(mean_scores['with_L'])
mean_scores['with_L'].rename(columns = {0: 'alpha', 1: 'Precision', 2: 'Recall', 3: 'F1 score'}, inplace = True)    

### Mean statistics

In [17]:
mean_scores['with_L'].sort_values(by=['alpha'])

Unnamed: 0,alpha,Precision,Recall,F1 score
1,0.0,0.993623,0.458245,0.625277
4,0.25,0.990968,0.609658,0.753615
0,0.5,0.986962,0.679594,0.804292
2,0.75,0.975972,0.714733,0.824674
3,1.0,0.965551,0.718503,0.823363


### Branch wise distance for $\alpha=0$ for with_L runs

In [18]:
dist_scores['with_L'][str(0)].sort_values(by=['branch'])

Unnamed: 0,branch,SCJTDFD,Cuts,Joins,Dups
15,"(Root,n1)",135.925,0.0,93.725,21.1
12,"(Root,n2)",29.35,8.9125,5.9875,7.225
11,"(n10,n11)",91.4875,7.9875,64.8,9.35
10,"(n10,n12)",59.325,6.2375,40.9375,6.075
14,"(n12,n17)",32.4375,3.7875,25.45,1.6
17,"(n12,n18)",37.0125,5.0,28.3125,1.85
3,"(n2,n3)",88.825,3.95,51.875,16.5
9,"(n2,n4)",48.2625,4.0375,18.575,12.825
2,"(n3,n13)",60.2375,2.275,48.7625,4.6
8,"(n3,n14)",65.7625,2.1375,51.525,6.05


### Branch wise distance for $\alpha=0.25$ for with_L runs

In [19]:
dist_scores['with_L'][str(0.25)].sort_values(by=['branch'])

Unnamed: 0,branch,SCJTDFD,Cuts,Joins,Dups
15,"(Root,n1)",136.1375,0.55,93.3875,21.1
12,"(Root,n2)",33.1125,4.05,15.2125,6.925
11,"(n10,n11)",91.575,20.7375,52.1375,9.35
10,"(n10,n12)",58.4625,14.15,32.1625,6.075
14,"(n12,n17)",33.0375,8.45,21.3875,1.6
17,"(n12,n18)",36.4125,9.0625,23.65,1.85
3,"(n2,n3)",86.9125,7.4125,46.5,16.5
9,"(n2,n4)",62.5,4.9625,33.4875,12.025
2,"(n3,n13)",60.2125,5.3875,45.625,4.6
8,"(n3,n14)",65.7875,5.275,48.4125,6.05


### Branch wise distance for $\alpha=0.5$ for with_L runs

In [20]:
dist_scores['with_L'][str(0.5)].sort_values(by=['branch'])

Unnamed: 0,branch,SCJTDFD,Cuts,Joins,Dups
15,"(Root,n1)",136.775,1.1625,93.4125,21.1
12,"(Root,n2)",36.0125,2.375,19.7875,6.925
11,"(n10,n11)",95.425,27.8,48.925,9.35
10,"(n10,n12)",57.9,17.375,28.375,6.075
14,"(n12,n17)",33.025,10.075,19.75,1.6
17,"(n12,n18)",36.425,10.7,22.025,1.85
3,"(n2,n3)",87.525,9.35,45.075,16.55
9,"(n2,n4)",68.7375,4.925,39.9875,11.9125
2,"(n3,n13)",60.2625,7.15,43.9125,4.6
8,"(n3,n14)",65.7375,6.9875,46.65,6.05


### Branch wise distance for $\alpha=0.75$ for with_L runs

In [21]:
dist_scores['with_L'][str(0.75)].sort_values(by=['branch'])

Unnamed: 0,branch,SCJTDFD,Cuts,Joins,Dups
15,"(Root,n1)",137.3,1.65,93.45,21.1
12,"(Root,n2)",39.3125,1.5375,24.025,6.875
11,"(n10,n11)",101.7,34.775,48.225,9.35
10,"(n10,n12)",63.275,22.3125,28.8125,6.075
14,"(n12,n17)",34.15,12.225,18.725,1.6
17,"(n12,n18)",37.775,12.9625,21.1125,1.85
3,"(n2,n3)",91.4625,11.575,46.8375,16.525
9,"(n2,n4)",72.45,6.1375,42.6875,11.8125
2,"(n3,n13)",61.15,10.125,41.825,4.6
8,"(n3,n14)",67.0,10.15,44.75,6.05


### Branch wise distance for $\alpha=1$ for with_L runs

In [22]:
dist_scores['with_L'][str(1)].sort_values(by=['branch'])

Unnamed: 0,branch,SCJTDFD,Cuts,Joins,Dups
15,"(Root,n1)",249.625,113.9,93.525,21.1
12,"(Root,n2)",99.2375,59.5375,25.75,6.975
11,"(n10,n11)",281.5875,214.5,48.3875,9.35
10,"(n10,n12)",240.85,199.3,29.35,6.1
14,"(n12,n17)",233.275,211.3125,18.7625,1.6
17,"(n12,n18)",237.975,213.25,21.025,1.85
3,"(n2,n3)",188.3,107.65,47.6,16.525
9,"(n2,n4)",176.225,108.95,43.45,11.9125
2,"(n3,n13)",218.075,167.1625,41.7125,4.6
8,"(n3,n14)",226.8375,169.9875,44.75,6.05


## Without loss but with higher rate of rearrangement

This mode shows the worst recall ratio by far. The precision drop is relatively high.

In [23]:
#no_LT_high_rearr
prec['no_LT_high_rearr'], rec['no_LT_high_rearr'], F1['no_LT_high_rearr'] = defaultdict(list), defaultdict(list), defaultdict(list)
dist['no_LT_high_rearr'], cuts['no_LT_high_rearr'], joins['no_LT_high_rearr'], dups['no_LT_high_rearr'] = defaultdict(list), defaultdict(list), defaultdict(list), defaultdict(list)
b_dist['no_LT_high_rearr'], b_cuts['no_LT_high_rearr'], b_joins['no_LT_high_rearr'], b_dups['no_LT_high_rearr'] = defaultdict(dict), defaultdict(dict), defaultdict(dict), defaultdict(dict)

files = [os.path.join(dp, f) for dp, dn, filenames in os.walk(no_LT_high_rearr) for f in filenames if "stats" in f]
for file in files:
    run = file.split("/")[3].split("_")[1]
    alpha = file.split("/")[3].split("_")[-1]
    with open(file, 'r') as f:
        for line in f:
            if "Precision" in line:
                prec['no_LT_high_rearr'] = update_dict(line, prec['no_LT_high_rearr'], alpha)
            elif "Recall" in line:
                rec['no_LT_high_rearr'] = update_dict(line, rec['no_LT_high_rearr'], alpha)
            elif "F1_score:" in line:
                F1['no_LT_high_rearr'] = update_dict(line, F1['no_LT_high_rearr'], alpha)
            elif "(" in line and "None" not in line:
                if len(line.split("\t")) > 3:
                    b_dist['no_LT_high_rearr'], b_cuts['no_LT_high_rearr'], b_joins['no_LT_high_rearr'], b_dups['no_LT_high_rearr'] = append_b_dist(line, b_dist['no_LT_high_rearr'], b_cuts['no_LT_high_rearr'], b_joins['no_LT_high_rearr'], b_dups['no_LT_high_rearr'], alpha)

            elif "Overall" in line:
                dist['no_LT_high_rearr'], cuts['no_LT_high_rearr'], joins['no_LT_high_rearr'], dups['no_LT_high_rearr'] = append_dist(line, dist['no_LT_high_rearr'], cuts['no_LT_high_rearr'], joins['no_LT_high_rearr'], dups['no_LT_high_rearr'], alpha)

In [24]:
d_per_branch = {}
dist_scores = {}
d_per_branch['no_LT_high_rearr'] = {}
dist_scores['no_LT_high_rearr'] = {}
for alpha in b_dist['no_LT_high_rearr']:
    d_per_branch['no_LT_high_rearr'][alpha] = {}
    dist_scores['no_LT_high_rearr'][alpha] = []
    for branch in b_dist['no_LT_high_rearr'][alpha]:
        #print(branch)
        d_per_branch['no_LT_high_rearr'][alpha][branch] = {}
        d_per_branch['no_LT_high_rearr'][alpha][branch]['dist'] = sum(b_dist['no_LT_high_rearr'][alpha][branch])/len(b_dist['no_LT_high_rearr'][alpha][branch])
        d_per_branch['no_LT_high_rearr'][alpha][branch]['cuts'] = sum(b_cuts['no_LT_high_rearr'][alpha][branch])/len(b_dist['no_LT_high_rearr'][alpha][branch])
        d_per_branch['no_LT_high_rearr'][alpha][branch]['joins'] = sum(b_joins['no_LT_high_rearr'][alpha][branch])/len(b_dist['no_LT_high_rearr'][alpha][branch])
        d_per_branch['no_LT_high_rearr'][alpha][branch]['dups'] = sum(b_dups['no_LT_high_rearr'][alpha][branch])/len(b_dist['no_LT_high_rearr'][alpha][branch])
        dist_scores['no_LT_high_rearr'][alpha].append([branch, d_per_branch['no_LT_high_rearr'][alpha][branch]['dist'], d_per_branch['no_LT_high_rearr'][alpha][branch]['cuts'], d_per_branch['no_LT_high_rearr'][alpha][branch]['joins'], d_per_branch['no_LT_high_rearr'][alpha][branch]['dups']])
    dist_scores['no_LT_high_rearr'][alpha] = pd.DataFrame(dist_scores['no_LT_high_rearr'][alpha]) 
    dist_scores['no_LT_high_rearr'][alpha].rename(columns = {0: 'branch', 1: 'SCJTDFD', 2: 'Cuts', 3: 'Joins', 4: 'Dups'}, inplace = True) 

In [25]:
mean = {}
mean_scores = {}
mean['no_LT_high_rearr'] = {}
mean_scores['no_LT_high_rearr'] = []
for alpha in prec['no_LT_high_rearr']:
    mean['no_LT_high_rearr'][alpha] = {}
    mean['no_LT_high_rearr'][alpha]['precision'] = sum(prec['no_LT_high_rearr'][alpha])/len(prec['no_LT_high_rearr'][alpha])
    mean['no_LT_high_rearr'][alpha]['recall'] = sum(rec['no_LT_high_rearr'][alpha])/len(rec['no_LT_high_rearr'][alpha])
    mean['no_LT_high_rearr'][alpha]['f1_score'] = sum(F1['no_LT_high_rearr'][alpha])/len(F1['no_LT_high_rearr'][alpha])
    mean_scores['no_LT_high_rearr'].append([alpha, mean['no_LT_high_rearr'][alpha]['precision'], mean['no_LT_high_rearr'][alpha]['recall'], mean['no_LT_high_rearr'][alpha]['f1_score']])

mean_scores['no_LT_high_rearr'] = pd.DataFrame(mean_scores['no_LT_high_rearr'])
mean_scores['no_LT_high_rearr'].rename(columns = {0: 'alpha', 1: 'Precision', 2: 'Recall', 3: 'F1 score'}, inplace = True)    

### Mean statistics

In [26]:
mean_scores['no_LT_high_rearr'].sort_values(by=['alpha'])

Unnamed: 0,alpha,Precision,Recall,F1 score
1,0.0,0.995853,0.280073,0.436931
4,0.25,0.985612,0.326363,0.490057
0,0.5,0.950925,0.357817,0.519523
2,0.75,0.917814,0.375727,0.532075
3,1.0,0.880963,0.380519,0.529139


### Branch wise distance for $\alpha=0$ for without_LT_high_rearr runs

In [27]:
dist_scores['no_LT_high_rearr'][str(0)].sort_values(by=['branch'])

Unnamed: 0,branch,SCJTDFD,Cuts,Joins,Dups
3,"(Root,n1)",46.0875,0.0,15.7375,15.175
12,"(Root,n2)",5.3,1.95,0.35,1.5
14,"(n1,n11)",112.9,0.6375,96.6625,7.8
9,"(n1,n12)",109.225,0.4375,94.6125,7.0875
6,"(n10,n13)",93.35,0.675,87.175,2.75
1,"(n10,n14)",92.5,0.625,86.875,2.5
13,"(n11,n15)",28.4625,0.2125,27.05,0.6
15,"(n11,n16)",29.6375,0.25,27.3875,1.0
10,"(n12,n17)",31.0875,0.35,29.2375,0.75
17,"(n12,n18)",30.7125,0.2625,28.95,0.75


### Branch wise distance for $\alpha=0.25$ for without_LT_high_rearr runs

In [28]:
dist_scores['no_LT_high_rearr'][str(0.25)].sort_values(by=['branch'])

Unnamed: 0,branch,SCJTDFD,Cuts,Joins,Dups
3,"(Root,n1)",53.0625,0.0875,23.025,14.975
12,"(Root,n2)",6.05,1.275,1.825,1.475
14,"(n1,n11)",109.0375,1.725,91.7125,7.8
9,"(n1,n12)",105.9875,1.575,90.2375,7.0875
6,"(n10,n13)",93.55,4.35,83.7,2.75
1,"(n10,n14)",92.3,4.1,83.2,2.5
13,"(n11,n15)",28.6125,2.2625,25.15,0.6
15,"(n11,n16)",29.2875,2.05,25.2375,1.0
10,"(n12,n17)",31.0125,2.575,26.9375,0.75
17,"(n12,n18)",30.6875,2.5125,26.675,0.75


### Branch wise distance for $\alpha=0.5$ for without_LT_high_rearr runs

In [29]:
dist_scores['no_LT_high_rearr'][str(0.5)].sort_values(by=['branch'])

Unnamed: 0,branch,SCJTDFD,Cuts,Joins,Dups
3,"(Root,n1)",54.8875,0.25,24.6875,14.975
12,"(Root,n2)",10.9875,1.2125,6.85,1.4625
14,"(n1,n11)",109.1375,5.0875,88.45,7.8
9,"(n1,n12)",107.025,5.275,87.575,7.0875
6,"(n10,n13)",93.4375,5.3625,82.575,2.75
1,"(n10,n14)",92.4125,5.225,82.1875,2.5
13,"(n11,n15)",28.675,3.0,24.475,0.6
15,"(n11,n16)",29.225,2.725,24.5,1.0
10,"(n12,n17)",31.2375,3.525,26.2125,0.75
17,"(n12,n18)",30.4625,3.2375,25.725,0.75


### Branch wise distance for $\alpha=0.75$ for without_LT_high_rearr runs

In [30]:
dist_scores['no_LT_high_rearr'][str(0.75)].sort_values(by=['branch'])

Unnamed: 0,branch,SCJTDFD,Cuts,Joins,Dups
3,"(Root,n1)",57.5125,1.4,26.0875,15.0125
12,"(Root,n2)",14.925,2.25,9.825,1.425
14,"(n1,n11)",110.1,6.6875,87.9125,7.75
9,"(n1,n12)",109.5375,7.5,87.8125,7.1125
6,"(n10,n13)",93.35,7.675,80.175,2.75
1,"(n10,n14)",92.675,7.7125,79.9625,2.5
13,"(n11,n15)",28.9,3.975,23.725,0.6
15,"(n11,n16)",29.4,3.675,23.725,1.0
10,"(n12,n17)",31.3625,4.525,25.3375,0.75
17,"(n12,n18)",30.4875,4.1875,24.8,0.75


### Branch wise distance for $\alpha=1$ for without_LT_high_rearr runs

In [31]:
dist_scores['no_LT_high_rearr'][str(1)].sort_values(by=['branch'])

Unnamed: 0,branch,SCJTDFD,Cuts,Joins,Dups
3,"(Root,n1)",108.2,50.5125,27.5375,15.075
12,"(Root,n2)",53.2125,38.75,11.5625,1.45
14,"(n1,n11)",255.4375,151.7625,88.175,7.75
9,"(n1,n12)",252.9625,150.8625,87.8,7.15
6,"(n10,n13)",274.0,188.4375,80.0625,2.75
1,"(n10,n14)",272.9125,188.2,79.7125,2.5
13,"(n11,n15)",206.55,181.6125,23.7375,0.6
15,"(n11,n16)",206.925,181.2125,23.7125,1.0
10,"(n12,n17)",211.3375,184.625,25.2125,0.75
17,"(n12,n18)",210.35,184.1,24.75,0.75
