# <font color = red> Simulation Analyze
This notebook records the parameters for Wright-Fisher simulations used to generate our test data sets, as well as commands for running infernce algorithms on the test data and compiling the results. 

## Contents
- ### [Libraries and variables](#lib)
- ### Data analyze
    - [Generation of test data through Wright-Fisher simulations](#sim)
    - [Repeating simulation](#rep)
    - [Running the inference algorithms and compiling output](#nsdt)
    - [Dealing with the output](#deal)

### <a id='lib'></a> Libraries and variables

In [1]:
# Full library list and version numbers

print('This notebook was prepared using:')

import sys
print('python version %s' % sys.version)

import numpy as np
print('numpy version %s' % np.__version__)

import random

import pandas as pd

import sklearn as sk
from sklearn.metrics import roc_auc_score

try:
    import itertools.izip as zip
except ImportError:
    import itertools

import importlib
import simulation as sim

# GitHub directories
HIV_DIR = 'data/HIV'
MPL_DIR = 'src/MPL'
SIM_DIR = 'data/simulation'
FIG_DIR = 'figures'

This notebook was prepared using:
python version 3.10.9 | packaged by conda-forge | (main, Feb  2 2023, 20:26:08) [Clang 14.0.6 ]
numpy version 1.24.2


<a id='sim'></a>
### Single Wright-Fisher simulation
The fitness model work like this:
$f_a = 1 + \sum_i^L s_i g_i^a + \sum_n^{N_t} s_n g_n^a$

This simulation begins with all wild type, which only has 2 alleles (wild type and mutant type).

Wright-Fisher simulations are performed using simulation.py. The output of these simulations is saved for processing. 

In this part, we use python code to run mpl, whose result is similar to C code.

In [90]:
importlib.reload(sim)

pdata = {
    'n_gen':    1000,                                   # number of generations
    'N':        1000,                                   # population size
    'mu':       2e-4,                                   # mutation rate
    'r':        2e-4,                                   # recombination rate
    't0':       0,                                      # start generation
    'T':        1000,                                   # final generation
    'ns':       [10, 20, 30, 40, 50,  80, 100, 1000],   # number of sequences to sample at each time point
    'dt':       [ 1,  5, 10, 20, 50],                   # spacing between generations
    'xfile':    '0_ns1000_dt1',                         # input file name
    'n_ben':    10,                                     # number of beneficial mutations
    'n_neu':    30,                                     # number of neutral mutations
    'n_del':    10,                                     # number of deleterious mutations
    's_ben':    0.02,                                   # selection coefficient of beneficial mutations
    's_neu':    0,                                      # selection coefficient of neutral mutations
    's_del':    -0.02,                                  # selection coefficient of deleterious mutations
    's_pol':    0.1,                                    # trait coefficient
    'gamma':    1,                                      # regularization value
    'alphabet': ['A', 'T'],                             # all possible alleles                              
    'escape_group':  [[1,2,3],[11,12,13]],              # binary trait sites
    'save_cov': True,                                   # if true, print covariance matrix
    'output' :  True,                                   # if true, print output
}

# sim.simulate(**pdata)
# sim.run_mpl_binary(**pdata)

<a id='rep'></a>
### Repeating simulation
In this part, we run multiple simulations and use C++ code to get inferred coefficients.

In [12]:
import json
from importlib import reload
reload(sim)

SIM_DIR = 'data/simulation'
n_sim   = 100

pdata['output'] = False
pdata['save_cov'] = False

# # get random escape groups for 100 simulations
# escape_groups = []
# for n in range(n_sim):
#     random_numbers   = random.sample(range(50), 6)
#     escape_group_raw = [random_numbers[:3],random_numbers[3:]]
#     escape_group     = [sorted(sublist) for sublist in escape_group_raw]
#     escape_groups.append(escape_group)
    
#     # trait sites 
#     f = open('%s/jobs/traitsite/traitsite-%s.dat'%(SIM_DIR,n), 'w')
#     for i in range(len(escape_group)):
#         f.write('%s\n'%'\t'.join([str(ii) for ii in escape_group[i]]))
#     f.close()
    
#     # distance between 2 trait sites
#     f = open('%s/jobs/traitdis/traitdis-%s.dat'%(SIM_DIR,n), 'w')
#     for i in range(len(escape_group)):
#         i_dis = []
#         for j in range(len(escape_group[i])-1):
#             i_dis.append(int(escape_group[i][j+1]-escape_group[i][j]))
#         f.write('%s\n'%'\t'.join([str(ii) for ii in i_dis]))
#     f.close()

# # save and load escape group information
# with open("%s/jobs/escape_groups.dat"%SIM_DIR, 'w') as file:
#     json.dump(escape_groups, file)

# with open("%s/jobs/escape_groups.dat"%SIM_DIR, 'r') as file:
#     escape_groups = json.load(file)

# # inference
# for n in range(n_sim):
#     pdata['escape_group'] = escape_groups[i]
#     pdata['xfile'] = 'jobs/sequences/example-'+str(i)+'_ns1000_dt1' 
#     pdata['yfile'] = 'jobs/output/sc-'+str(i)   
#     sim.simulate(**pdata)
#     sim.run_mpl_binary(**pdata)
#     sim.run_mpl_binary_noR(**pdata)

# print('we have done %d times simulations'%n)

In [68]:
'binary C code'
f = open("src/run_sim.sh",'w')
f.write('g++ binary/main.cpp binary/inf_binary.cpp binary/io_binary.cpp -march=native -lgsl -lgslcblas -o mpl\n')
n = 100
for n in range(n_sim):
    f.write('./mpl -d ../data/simulation -i jobs/sequences/example-%d_ns1000_dt1.dat '%n)
    f.write('-o jobs/output/sc-%d.dat -g 1 -N 1e3 -mu 2e-4 -rr 2e-4 '%n)
    f.write('-e jobs/traitsite/traitsite-%d.dat -es traitseq.dat -ed jobs/traitdis/traitdis-%d.dat '%(n,n))
    f.write('-sc jobs/covariance/covariance-%d.dat\n'%(n))
    
f.close()

print('we have produced the shell script run.sh')

we have produced the shell script run.sh


In [100]:
'multiple C code'
f = open("src/run_sim_mul.sh",'w')
f.write('g++ main.cpp inf.cpp io.cpp -march=native -lgsl -lgslcblas -o mpl\n')
n_sim = 100
for n in range(n_sim):
    f.write('./mpl -d ../data/simulation -i jobs/sequences/example-%d_ns1000_dt1.dat '%n)
    f.write('-o jobs/output_multiple/sc-%d.dat -g 1 -mu 2e-4 -rr 2e-4 -q 2 '%n)
    f.write('-e jobs/traitsite/traitsite-%d.dat -es traitseq.dat -ed jobs/traitdis/traitdis-%d.dat\n'%(n,n))
    
f.close()

print('we have produced the shell script run.sh')

we have produced the shell script run.sh


In [45]:
nB = pdata['n_ben']
nD = pdata['n_del']
nN = pdata['n_neu']
ne = 2
seq_length = nB+nD+nN

f = open('%s/mpl_collected_C.csv'%SIM_DIR,'w')
f.write('trajectory,ns,delta_t')
for i in range(seq_length):
    f.write(',sc_%d'%i)
for i in range(ne):
    f.write(',pc_%d'%i)
f.write('\n')

for k in range(n):
    sc = np.loadtxt('%s/jobs/output/sc-%d.dat'%(SIM_DIR,k))
    f.write('%d,1000,1'%(k))
    for ii in range(seq_length):
        f.write(',%f'%(sc[2*ii+1]-sc[2*ii]))
    for ii in range(ne):
        f.write(',%f'%sc[-ne+ii])
    f.write('\n')
f.close()

<a id='nsdt'></a>
### Running the inference algorithms and compiling output
For one simulation, use different n_s and Δt to get the result

In [72]:
from importlib import reload
reload(sim)

ns_vals = pdata['ns']
dt_vals = pdata['dt']

n_sim = 100

g = open("src/sim_expand.sh",'w')
g.write('g++ binary/main.cpp binary/inf_binary.cpp binary/io_binary.cpp -march=native -lgsl -lgslcblas -o mpl\n')
for k in range(n_sim):
    pdata['xfile'] = 'example-'+str(k)
#     sim.py2c(**pdata)
    for i in range(len(ns_vals)):
        for j in range(len(dt_vals)):
            g.write('./mpl -d ../data/simulation ')
            g.write('-i jobs/sequences/nsdt/example-%d_ns%d_dt%d.dat '%(k,ns_vals[i],dt_vals[j]))
            g.write('-o jobs/output/nsdt/sc-%d_ns%d_dt%d.dat '%(k,ns_vals[i],dt_vals[j]))
            g.write('-g 1 -N 1e3 -mu 2e-4 -rr 2e-4 -e jobs/traitsite/traitsite-%d.dat '%(k))
            g.write('-es traitseq.dat -ed jobs/traitdis/traitdis-%d.dat\n'%(k))
g.close()


<a id='deal'></a>
### Dealing with the output
collect coefficients for all simulations and write the result into mpl_collected.csv.

In [95]:
nB = pdata['n_ben']
nD = pdata['n_del']
nN = pdata['n_neu']
nP = 2
seq_length = nB+nD+nN

f = open('%s/mpl_collected_all.csv'%SIM_DIR,'w')
f.write('trajectory,t0,T,ns,delta_t')
for i in range(seq_length):
    f.write(',sc_%d'%i)
for i in range(nP):
    f.write(',pc_%d'%i)
f.write('\n')

for k in range(n_sim):
    for i in range(len(ns_vals)):
        for j in range(len(dt_vals)):
            sc = np.loadtxt('%s/jobs/output/nsdt/sc-%d_ns%d_dt%d.dat'%(SIM_DIR,k,ns_vals[i],dt_vals[j]))
            f.write('%d,0,1000,%d,%d'%(k,ns_vals[i],dt_vals[j]))
            for ii in range(seq_length):
                f.write(',%f'%sc[ii])
            for ii in range(nP):
                f.write(',%f'%sc[-nP+ii])
            f.write('\n')
f.close()

print('collect all coefficients for %d simulations'%n_sim)

collect all coefficients for 100 simulations


In [103]:
nB = pdata['n_ben']
nD = pdata['n_del']
nN = pdata['n_neu']
nP = 2
seq_length = nB+nD+nN

f = open('%s/mpl_collected_multiple.csv'%SIM_DIR,'w')
f.write('trajectory,t0,T,ns,delta_t')
for i in range(seq_length):
    f.write(',sc_%d'%i)
for i in range(nP):
    f.write(',pc_%d'%i)
f.write('\n')

for k in range(n_sim):
    sc = np.loadtxt('%s/jobs/output_multiple/sc-%d.dat'%(SIM_DIR,k))
    f.write('%d,0,1000,1000,1'%(k))
    for ii in range(seq_length):
        f.write(',%f'%(sc[2*ii+1]-sc[2*ii]))
    for ii in range(nP):
        f.write(',%f'%sc[-nP+ii])
    f.write('\n')
f.close()

print('collect all coefficients for %d simulations'%n_sim)

collect all coefficients for 100 simulations


calculate AUROC for beneficial and deleterious mutation and NRMSE for trait part
write these results into mpl_collected_extended.csv

In [98]:
nB = pdata['n_ben']
nD = pdata['n_del']
nN = pdata['n_neu']
nP = 2
fB = pdata['s_ben']
fD = pdata['s_del']
fP = pdata['s_pol']

true_ben = [1 if i in                        range(nB) else 0 for i in range(seq_length)];
true_del = [1 if i in range(seq_length-nD, seq_length) else 0 for i in range(seq_length)];
true_neu = [1 if i in range(        nB, seq_length-nD) else 0 for i in range(seq_length)];

coefs = ['sc_%d' % j for j in range(seq_length)]

df              = pd.read_csv('%s/mpl_collected_all.csv'%SIM_DIR, memory_map=True);

# difference between inferred coefficients and true coefficients
for i in range(seq_length):
    if   true_ben[i]: df['d_sc%d' % i] = df['sc_%d' % i] - fB;
    elif true_del[i]: df['d_sc%d' % i] = df['sc_%d' % i] - fD;
    elif true_neu[i]: df['d_sc%d' % i] = df['sc_%d' % i];
for i in range(nP):
    df['d_pc%d' % i] = df['pc_%d' % i] - fP;

# AUROC for beneficial and deleterious mutation
df['AUROC_ben'] = pd.Series(data=[roc_auc_score(true_ben, np.array(df.iloc[i][coefs])) for i in range(len(df))]);
df['AUROC_del'] = pd.Series(data=[roc_auc_score(true_del,-np.array(df.iloc[i][coefs])) for i in range(len(df))]);

# error for trait part
norm  = nP*(fP**2)
error = 0
for i in range(nP):
    error += (df['pc_%d' % i] - fP) ** 2
df['error_pol'] = np.sqrt(error/norm)


df.to_csv('%s/mpl_collected_extended.csv'%SIM_DIR)

print('collect all data for %d simulations'%n_sim)

collect all data for 100 simulations
