# <font color = red> Simulation Analyze
This notebook records the parameters for Wright-Fisher simulations used to generate our test data sets, as well as commands for running infernce algorithms on the test data and compiling the results. 

## Contents
- ### [Libraries and variables](#lib)
- ### Data analyze
    - [Generation of test data through Wright-Fisher simulations](#sim)
    - [Repeating simulation](#rep)
    - [Running the inference algorithms and compiling output](#nsdt)
    - [Dealing with the output](#deal)

### <a id='lib'></a> Libraries and variables

In [1]:
# Full library list and version numbers

print('This notebook was prepared using:')

import sys
print('python version %s' % sys.version)

import numpy as np
print('numpy version %s' % np.__version__)

import random

import pandas as pd

import sklearn as sk
from sklearn.metrics import roc_auc_score

try:
    import itertools.izip as zip
except ImportError:
    import itertools

import simulation as sim

# GitHub directories
HIV_DIR = 'data/HIV'
MPL_DIR = 'src/MPL'
SIM_DIR = 'data/simulation'
FIG_DIR = 'figures'
HIV_MPL_DIR = 'src/MPL/HIV'

This notebook was prepared using:
python version 3.9.13 (main, Aug 25 2022, 18:24:45) 
[Clang 12.0.0 ]
numpy version 1.21.5


<a id='sim'></a>
### Generation of test data through Wright-Fisher simulations
The fitness model work like this:
$f_a = 1 + \sum_i^L s_i g_i^a + \sum_n^{N_p} s_n g_n^a$

This simulation begins with all wild type, which only has 2 alleles (wild type and mutant type).

Wright-Fisher simulations are performed using simulation.py. The output of these simulations is saved for processing. 

In this part, we use python code to run mpl, whose result is similar to C code.

In [8]:
from imp import reload
reload(sim)

pdata = {
    'n_gen':    1000,                                   # number of generations
    'N':        1000,                                   # population size
    'mu':       2e-4,                                   # mutation rate
    't0':       0,                                      # start generation
    'T':        1000,                                   # final generation
    'ns':       [10, 20, 30, 40, 50,  80, 100, 1000],   # number of sequences to sample at each time point
    'dt':       [ 1,  5, 10, 20, 50],                   # spacing between generations
    'xfile':    'example-0_ns1000_dt1',                 # input file name
    'yfile':    'example',                              # output file name
    'n_ben':    10,                                     # number of beneficial mutations
    'n_neu':    30,                                     # number of neutral mutations
    'n_del':    10,                                     # number of deleterious mutations
    'n_pol':    2,                                      # number of trait groups([1,2,3],[11,12,13])
    's_ben':    0.02,                                   # selection coefficient of beneficial mutations
    's_neu':    0,                                      # selection coefficient of neutral mutations
    's_del':    -0.02,                                  # selection coefficient of deleterious mutations
    's_pol':    0.1,                                    # trait coefficient
    'gamma':    1,                                      # regularization value
    'alphabet': ['A', 'T'],                             # all possible alleles                              
    'po_site':  [[1,2,3],[11,12,13]],                   # binary trait sites
    'save_cov': True,                                   # if true, print covariance matrix
    'output' :  True,                                   # if true, print output
    't_format': True,                                   # if true, output txt format
    'd_format': True,                                   # if true, output dat format
}

sim.simulate(**pdata)
sim.run_mpl_binary(**pdata)

Simulation completes
beginning with wild type, sequence has 2 types of alleles, totally run 1000 generations
the length of the sequence is 50, with 10 beneficial sites (0.02) and 10 deleterious sites（-0.02)
containing 2 trait groups with trait coefficients equal to 0.1, they are 
[[1, 2, 3], [11, 12, 13]]
Calculation completes
inferred beneficial selection coefficients are:
[[0.01558462]
 [0.01511324]
 [0.01905374]
 [0.0237253 ]
 [0.01280081]
 [0.01603826]
 [0.01896436]
 [0.01336382]
 [0.02320998]
 [0.0151462 ]]
inferred neutral selection coefficients are:
[[ 0.0038976 ]
 [-0.01560297]
 [-0.00960473]
 [ 0.00108885]
 [ 0.00476186]
 [ 0.00799394]
 [-0.00991255]
 [ 0.00191754]
 [ 0.00059171]
 [-0.0035239 ]
 [ 0.01177641]
 [ 0.01258354]
 [-0.00039107]
 [ 0.00505423]
 [ 0.00718061]
 [-0.00881666]
 [ 0.00576916]
 [ 0.00362709]
 [-0.00177835]
 [-0.00485225]
 [-0.00615877]
 [ 0.00034462]
 [-0.01221871]
 [-0.00076708]
 [-0.01102052]
 [ 0.00163876]
 [-0.00999011]
 [-0.01036008]
 [-0.00190997]
 [

<a id='rep'></a>
### Repeating simulation
In this part, we run multiple simulations and use C++ code to get inferred coefficients.

In [3]:
from imp import reload
reload(sim)

SIM_DIR = 'data/simulation'
n       = 100

pdata['output'] = False

f = open("./%s/jobs/run.sh"%SIM_DIR,'w')
f.write('g++ main.cpp inf_binary.cpp io.cpp  -march=native -lgsl -lgslcblas -o mpl\n')
    
for i in range(n):
    pdata['xfile'] = 'jobs/seq/example-'+str(i)+'_ns1000_dt1'
    sim.simulate(**pdata)
    f.write('./mpl -i ./data/seq/example-%d_ns1000_dt1.dat -o ./output/sc-%d.dat -g 1e3 '%(i,i))
    f.write('-N 1e3 -mu 2e-4 -q 2 -p ./data/polysite.dat -ps ./data/polysequence.dat\n')

f.close()
print('we have done %d times simulations and produces 1 shell scripts run.sh'%n)

we have done 100 times simulations and produces 1 shell scripts run.sh


<a id='nsdt'></a>
### Running the inference algorithms and compiling output
For one simulation, use different n_s and Δt to get the result

In [None]:
from imp import reload
reload(sim)

ns_vals = pdata['ns']
dt_vals = pdata['dt']

n = 100
g = open("./%s/jobs/expand.sh"%SIM_DIR,'w')
g.write('g++ main.cpp inf_binary.cpp io.cpp  -march=native -lgsl -lgslcblas -o mpl\n')
for k in range(n):
    pdata['xfile'] = 'example-'+str(k)
    sim.py2c(**pdata)
    for i in range(len(ns_vals)):
        for j in range(len(dt_vals)):
            g.write('./mpl -i data/input/example-%d_ns%d_dt%d.dat'%(k,ns_vals[i],dt_vals[j]))
            g.write(' -o output/sc-%d_ns%d_dt%d.dat'%(k,ns_vals[i],dt_vals[j]))
            g.write(' -g 1e3 -N 1e3 -mu 2e-4 -q 2 -p data/polysite.dat -ps data/polysequence.dat\n')
    g.write('rm -r data/input/example-%d_*\n'%k)
g.close()

<a id='deal'></a>
### Dealing with the output
collect coefficients for all simulations and write the result into mpl_collected.csv.

In [5]:
nB = pdata['n_ben']
nD = pdata['n_del']
nN = pdata['n_neu']
nP = pdata['n_pol']
seq_length = nB+nD+nN

f = open('%s/mpl_collected.csv'%SIM_DIR,'w')
f.write('trajectory,t0,T,ns,delta_t')
for i in range(seq_length):
    f.write(',sc_%d'%i)
for i in range(nP):
    f.write(',pc_%d'%i)
f.write('\n')

for k in range(n):
    for i in range(len(ns_vals)):
        for j in range(len(dt_vals)):
            sc = np.loadtxt('%s/jobs/output/sc-%d_ns%d_dt%d.dat'%(SIM_DIR,k,ns_vals[i],dt_vals[j]))
            f.write('%d,0,1000,%d,%d'%(k,ns_vals[i],dt_vals[j]))
            for ii in range(seq_length):
                f.write(',%f'%sc[ii])
            for ii in range(nP):
                f.write(',%f'%sc[-nP+ii])
            f.write('\n')
f.close()
print('collect all coefficients for %d simulations'%n)

collect all coefficients for 100 simulations


calculate AUROC for beneficial and deleterious mutation and NRMSE for trait part
write these results into mpl_collected_extended.csv

In [6]:
nB = pdata['n_ben']
nD = pdata['n_del']
nN = pdata['n_neu']
nP = pdata['n_pol']
fB = pdata['s_ben']
fD = pdata['s_del']
fP = pdata['s_pol']

true_ben = [1 if i in                        range(nB) else 0 for i in range(seq_length)];
true_del = [1 if i in range(seq_length-nD, seq_length) else 0 for i in range(seq_length)];
true_neu = [1 if i in range(        nB, seq_length-nD) else 0 for i in range(seq_length)];

coefs = ['sc_%d' % j for j in range(seq_length)]

df              = pd.read_csv('%s/mpl_collected.csv'%SIM_DIR, memory_map=True);

# difference between inferred coefficients and true coefficients
for i in range(seq_length):
    if   true_ben[i]: df['d_sc%d' % i] = df['sc_%d' % i] - fB;
    elif true_del[i]: df['d_sc%d' % i] = df['sc_%d' % i] - fD;
    elif true_neu[i]: df['d_sc%d' % i] = df['sc_%d' % i];
for i in range(nP):
    df['d_pc%d' % i] = df['pc_%d' % i] - fP;

# AUROC for beneficial and deleterious mutation
df['AUROC_ben'] = pd.Series(data=[roc_auc_score(true_ben, np.array(df.iloc[i][coefs])) for i in range(len(df))]);
df['AUROC_del'] = pd.Series(data=[roc_auc_score(true_del,-np.array(df.iloc[i][coefs])) for i in range(len(df))]);

# error for trait part
norm  = nP*(fP**2)
error = 0
for i in range(nP):
    error += (df['pc_%d' % i] - fP) ** 2
df['error_pol'] = np.sqrt(error/norm)


df.to_csv('%s/mpl_collected_extended.csv'%SIM_DIR)

print('collect all data for %d simulations'%n)

collect all data for 100 simulations
