# Experiment 1
### Cedric Chauve, 11/12/2018

## Introduction

In this experiment (script *exp1a.sh*) we counted the number of histories for the following data:
- species tree size (number of leaves) from 3 to 32 (exp1a) and 33 to 64 (exp1b),
- for each species tree size, we considered 100 trees 
    - the first one (index 0) is the caterpillar,
    - if k is a power of 2 the second tree (index 1) is the complete binary tree,
    - the remaining trees are random,
- the history size (number of leaves) ranges from 1 to 50 (exp1a) and 1 to 128 (exp1b).

We record the results for species trees of a given size *k* in the file *results/exp1a_k*. Each non-comment row of the result file has the following tab-separated format:
- species tree size,
- species tree index,
- ranking type (U for unranked, we do not consider ranked trees),
- newick string describing the tree,
- number of histories separated by spaces.

For each configuration, we count the number of histories in two models, one with only DL histories and one with DLT histories.

In [5]:
import csv
import pandas as pd
import numpy as np
import gzip
import io

In [17]:
# Parameters

# Number of species trees
NB_S_TREES    = 100
S_TREES_INDEX = [i for i in range(0,NB_S_TREES)]
# Evolutionary models
EVOL_MODELS = [('U','DL'),('U','DLT')]

In [65]:
# Format: RESULTS[evol_model][s][n][tree_index] is 
# the number of histories of size n for tree tree_index of size s in model evol_model

def read_results(S_SIZES,H_SIZES,S_TREES_INDEX,PREFIX):
    RESULTS = {x:{s:{n:{t:0 for t in S_TREES_INDEX} for n in H_SIZES}  for s in S_SIZES} for x in EVOL_MODELS}
    for s in S_SIZES:
        with gzip.open('../results/'+PREFIX+'_'+str(s)+'.gz', 'r') as f:
            reader = csv.reader(io.TextIOWrapper(f, newline=""),delimiter='\t')
            for row in reader:
                if row[0][0]!='#':
                    model = (row[2],row[3])
                    t_ind = int(row[1])
                    row5  = row[5].split()
                    for n in H_SIZES:
                        RESULTS[model][s][n][t_ind] = int(row5[n-1])
                    
    RESULTS_frame = pd.DataFrame.from_dict({(m,s,n): RESULTS[m][s][n] 
                                            for m in RESULTS.keys() 
                                            for s in RESULTS[m].keys()
                                            for n in RESULTS[m][s].keys()},
                                            orient='index')
    return((RESULTS,RESULTS_frame))

def compute_stats(S_SIZES,H_SIZES,S_TREES_INDEX):
    STATS = {x:{s:{n:{} for n in H_SIZES}  for s in S_SIZES} for x in EVOL_MODELS}

    for x in EVOL_MODELS:
        for s in S_SIZES:
            for n in H_SIZES:
                data =  np.array([RESULTS[x][s][n][t] for t in S_TREES_INDEX])
                STATS[x][s][n] = {'avg':np.mean(data), 'std':np.std(data), 'max/min':np.max(data)/np.min(data)}
            
    STATS_frame = pd.DataFrame.from_dict({(m,s,n): STATS[m][s][n] 
                                         for m in STATS.keys() 
                                         for s in STATS[m].keys()
                                         for n in STATS[m][s].keys()},
                                         orient='index')
    return((STATS,STATS_frame))

def compute_ratio_DL_DLT(RESULTS,S_SIZES,H_SIZES,S_TREES_INDEX):
    RATIOS = {s:{n:{} for n in H_SIZES}  for s in S_SIZES}
    for s in S_SIZES:
        for n in H_SIZES:
            ratios = np.array([RESULTS[('U','DLT')][s][n][t]/RESULTS[('U','DL')][s][n][t] for t in S_TREES_INDEX])
            RATIOS[s][n] = {'avg':np.mean(ratios), 'std':np.std(ratios), 'min':np.min(ratios), 'max':np.max(ratios), 'max/min':np.max(ratios)/np.min(ratios)}
            
    RATIOS_frame = pd.DataFrame.from_dict({(s,n): RATIOS[s][n]
                                          for s in RATIOS.keys()
                                          for n in RATIOS[s].keys()},
                                          orient='index')
    return((RATIOS,RATIOS_frame))

## Experiment exp1a

### Analysis 1.
The first analysis just look at the number of histories for each pair *(s,n)* (*s* = species tree size, *n* = histories size). For each selected pair, we look at the average number of histories, the standard deviation and the ration *max/min*.

In [36]:
# Analyse 1: average, standard deviation, ratio min and max for the number of histories per model for a given species tree size
S_SIZES_1a = [4,8,16,32]
H_SIZES_1a = [10,20,30,40,50]

(RESULTS_1a,RESULTS_1a_frame) = read_results(S_SIZES_1a,H_SIZES_1a,S_TREES_INDEX,'exp1a')
(STATS_1a_1,STATS_1a_1_frame) = compute_stats(S_SIZES_1a,H_SIZES_1a, S_TREES_INDEX)

In [58]:
np.std([RESULTS_1a[('U','DLT')][4][10][t]/RESULTS_1a[('U','DL')][4][10][t] for t in S_TREES_INDEX])

28.96211710207647

In [37]:
STATS_1a_1_frame

Unnamed: 0,Unnamed: 1,Unnamed: 2,avg,std,max/min
"(U, DL)",4,10,206815500000.0,46184510000.0,2.738565
"(U, DL)",4,20,2.402022e+24,7.350661e+23,6.497603
"(U, DL)",4,30,4.578739e+37,1.569684e+37,15.72603
"(U, DL)",4,40,1.0598709999999999e+51,3.797903e+50,38.2282
"(U, DL)",4,50,2.720679e+64,9.924259e+63,93.05067
"(U, DL)",8,10,1910970000000000.0,1566116000000000.0,35.66652
"(U, DL)",8,20,2.474365e+32,2.9864380000000003e+32,844.4849
"(U, DL)",8,30,6.030736e+49,8.544215e+49,20285.94
"(U, DL)",8,40,1.91094e+67,2.9461699999999997e+67,495821.5
"(U, DL)",8,50,6.9344930000000004e+84,1.125588e+85,12203430.0


### Comments.
For both the DL and DLT models, the standard deviation is larger than the mean, indicating a very large spread of the distribution of the number of histories. This is also illustrated by the very large ration *max/min*.

### Analysis 2. 
We look at the ratio between the number of DLT-histories and the number of DL-histories.

In [66]:
(RATIOS_DLT_DL_1a,RATIOS_DLT_DL_1a_frame) = compute_ratio_DL_DLT(RESULTS_1a,S_SIZES_1a,H_SIZES_1a,S_TREES_INDEX)

In [67]:
RATIOS_DLT_DL_1a_frame

Unnamed: 0,Unnamed: 1,avg,std,min,max,max/min
4,10,5.65116,28.96212,1.0,289.3654,289.3654
4,20,631.1465,6014.839,1.0,60444.44,60444.44
4,30,112414.1,1110995.0,1.0,11166440.0,11166440.0
4,40,20013450.0,198872500.0,1.0,1998768000.0,1998768000.0
4,50,3547184000.0,35284890000.0,1.0,354627400000.0,354627400000.0
8,10,314.4372,3062.889,1.0,30789.02,30789.02
8,20,12220640.0,121586200.0,1.0,1221988000.0,1221988000.0
8,30,431934100000.0,4297687000000.0,1.0,43193380000000.0,43193380000000.0
8,40,1.471052e+16,1.463678e+17,1.0,1.471052e+18,1.471052e+18
8,50,4.920755e+20,4.89609e+21,1.0,4.920755e+22,4.920755e+22


### Comments.
Again, a very large spread, as well as a quick increase of the ratio. This goes along the intuition that the search space grows very quickly when transfers are added to the model. 

## Experiment exp1b

The numbers of histories are too large for python:
"OverflowError: cannot convert float infinity to integer"
