# Experiment 1a
### Cedric Chauve, 11/12/2018

## Introduction

In this experiment (script *exp1a.sh*) we counted the number of histories for the following data:
- species tree size (number of leaves) from 3 to 32,
- for each species tree size, we considered 100 trees 
    - the first one (index 0) is the caterpillar,
    - if k is a power of 2 the second tree (index 1) is the complete binary tree,
    - the remaining trees are random,
- the history size (number of leaves) ranges from 1 to 50,
- for each species tree we we considered 25 random rankings.

We record the results for species trees of a given size *k* in the file *results/exp1a_k*. Each non-comment row of the result file has the following tab-separated format:
- species tree size
- species tree index
- ranking type (U for unranked, R for ranked)
- if unranked, newick string describing the tree, otherwie ranking of internal nodes
- number of histories separated by spaces.

For each configuration, we count the number of histories in a model with only DL histories or also DLT histories.

In [37]:
import csv
import pandas as pd
import numpy as np

In [103]:
# Parameters

# Species tree
S_SIZE_MIN = 3
S_SIZE_MAX = 4
S_SIZES    = [i for i in range(S_SIZE_MIN,S_SIZE_MAX+1)]
S_SIZES_POW2 = [4]

# Number of species trees
NB_S_TREES    = 100
S_TREES_INDEX = [i for i in range(0,NB_S_TREES)]

# History size
H_SIZE_MIN = 1
H_SIZE_MAX = 50
H_SIZES    = [i for i in range(H_SIZE_MIN,H_SIZE_MAX+1)]
H_SIZES_10 = [10,20,30,40,50]

# Number of rankings
NB_RANKINGS    = 25
RANKINGS_INDEX = [i for i in range(0,NB_RANKINGS)]

# Evolutionary models
EVOL_MODELS = [('U','DL'),('U','DLT')]

In [104]:
# Format: RESULTS[evol_model][s][n][tree_index] is 
# the number of histories of size n for tree tree_index of size s in model evol_model

RESULTS = {x:{s:{n:{t:0 for t in S_TREES_INDEX} for n in H_SIZES}  for s in S_SIZES} for x in EVOL_MODELS}
for s in S_SIZES:
    with open('../results/exp1a_'+str(s), 'r') as f:
        reader = csv.reader(f,delimiter='\t')
        for row in reader:
            if row[0][0]!='#':
                model = (row[2],row[3])
                t_ind = int(row[1])
                row5  = row[5].split()
                for n in H_SIZES:
                    RESULTS[model][s][n][t_ind] = int(row5[n-1])
                    
RESULTS_frame = pd.DataFrame.from_dict({(m,s,n): RESULTS[m][s][n] 
                                        for m in RESULTS.keys() 
                                        for s in RESULTS[m].keys()
                                        for n in RESULTS[m][s].keys()},
                                        orient='index')

In [106]:
# Analyse 1: average, standard deviation, ratio min and max for the number of histories per model for a given species tree size
STATS1 = {x:{s:{n:{} for n in H_SIZES_10}  for s in S_SIZES_POW2} for x in EVOL_MODELS}

for x in EVOL_MODELS:
    for s in S_SIZES_POW2:
        for n in H_SIZES_10:
            data =  np.array([RESULTS[x][s][n][t] for t in S_TREES_INDEX])
            STATS1[x][s][n] = {'avg':np.mean(data), 'std':np.std(data), 'max/min':np.max(data)/np.min(data)}
            
STATS1_frame = pd.DataFrame.from_dict({(m,s,n): STATS1[m][s][n] 
                                        for m in STATS1.keys() 
                                        for s in STATS1[m].keys()
                                        for n in STATS1[m][s].keys()},
                                        orient='index')            

In [107]:
STATS1_frame

Unnamed: 0,Unnamed: 1,Unnamed: 2,avg,std,max/min
"(U, DL)",4,10,192603200000.0,58873930000.0,2.738565
"(U, DL)",4,20,2.175821e+24,9.370291e+23,6.497603
"(U, DL)",4,30,4.095702e+37,2.0009619999999999e+37,15.72603
"(U, DL)",4,40,9.429982000000001e+50,4.8413960000000006e+50,38.2282
"(U, DL)",4,50,2.415281e+64,1.2650999999999999e+64,93.05067
"(U, DLT)",4,10,699298700000.0,2568036000000.0,289.3654
"(U, DLT)",4,20,3.119619e+26,2.526409e+27,60444.44
"(U, DLT)",4,30,4.042653e+41,3.659656e+42,11166440.0
"(U, DLT)",4,40,6.583371e+56,6.247807e+57,1998768000.0
"(U, DLT)",4,50,1.2046229999999998e+72,1.170706e+73,354627400000.0
