# <font color = red> Simulation Analyze
This notebook records the parameters for Wright-Fisher simulations used to generate our test data sets, as well as commands for running infernce algorithms on the test data and compiling the results. 

## Contents
- ### [Libraries and variables](#lib)
- ### [Simple simulation](#sim)
    - #### [Wright-Fisher simulations without binary trait term](#simple)
    - #### [Try different $\gamma^{\prime}$](#simple-gamma)
    - #### [Consider all sites are time-varying](#simple-tv)
- ### [Simulation with trait](#sim2)
    - #### [Wright-Fisher simulations with binary trait terms](#trait)
    - #### [Finite sample data inference](#dt)

### <a id='lib'></a> Libraries and variables

In [1]:
print('This notebook was prepared using:')

import os
import sys
print('python version %s' % sys.version)

import numpy as np
print('numpy version %s' % np.__version__)

import pandas as pd
print('pandas version %s' % pd.__version__)

import math
from math import isnan

import matplotlib
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import matplotlib.image as mpimg
print('matplotlib version %s' % matplotlib.__version__)

import re
import sys
import argparse
import scipy as sp
import random

from scipy import integrate
import scipy.interpolate as sp_interpolate
import statistics

from dataclasses import dataclass
import time as time_module

import json
from importlib import reload

import simulation as sim
import importlib

# GitHub directories
HIV_DIR = 'data/HIV'
MPL_DIR = 'src/MPL'
SIM_DIR = 'data/simulation'
FIG_DIR = 'figures'

This notebook was prepared using:
python version 3.10.9 | packaged by conda-forge | (main, Feb  2 2023, 20:26:08) [Clang 14.0.6 ]
numpy version 1.26.4
pandas version 1.5.3
matplotlib version 3.7.1


<a id='sim'></a>
### Simple simulation

<a id='simple'></a>
#### Wright-Fisher simulations without binary trait term
The fitness model work like this:
$f_a = 1 + \sum_i^{\ell} s_i g_i^a$

This simulation begins with 4 random initial type, which only has 2 alleles (wild type and mutant type).

Wright-Fisher simulations are performed using simulation.py. The output of these simulations is saved for processing. 

2 benefial (s = 0.02, [0,1]), 2 neutral (s = 0, [2,3]), 2 delerious (s = -0.02, [4,5]) and 4 time varying site (s = 0.04 * sin(2$\pi$t/$\tau$)[6,7] and s = 0.04 * cos(2$\pi$t/$\tau$)[8,9])

In [2]:
''' parameter '''
importlib.reload(sim)

generations = 1000
fi_1 = np.zeros(generations+1)
fi_2 = np.zeros(generations+1)

for t in range(len(fi_1)):
    fi_1[t] = np.sin(t*2*np.pi/generations)*0.04
    fi_2[t] = np.cos(t*2*np.pi/generations)*0.04
    
pdata = {  
    'NUC':           ['A', 'T'],      # all possible alleles
    'dir':           'simple',        # directory of this simulation
    'xfile':         '0',             # output file name
    'output_dir':    'output',        # directory of reference result
    'seq_length':    10,              # sequence length
    'pop_size':      1000,            # population size
    'generations':   generations,     # number of total generations
    'mut_rate':      1e-3,            # mutation rate
    'rec_rate':      1e-3,            # recombination rate
    'inital_state':  4,               # number of initial sub-population
    'bene':          [0,1],           # constant beneficial mutations sites
    'dele':          [4,5],           # constant deleterious mutations sites
    'p_1':           [6,7],           # time-varying mutations sites (sin)
    'p_2':           [8,9],           # time-varying mutations sites (cos)
    's_ben':         0.02,            # selection coefficient of beneficial mutations
    's_del':         -0.02,           # selection coefficient of deleterious mutations
    'fi_1':          fi_1,            # time-varying selection coefficient for individual site (sin)
    'fi_2':          fi_2,            # time-varying selection coefficient for individual site (cos)
    'gamma_s':       1,               # regularization - selection coefficients - constant part
    'gamma_2c':      100000,          # regularization - the time derivative of the selection coefficients
    'gamma_2tv':     200,             # regularization - the time derivative of the selection coefficients
    'theta':         0.5,             # magnification of extended time at the ends
    'beta':          4,               # magnification of extended gamma_2 at the ends
    }

In [3]:
''' Create simulated evolutionary trajectories'''

reload(sim)

n_sim   = 100

# simulation
for k in range(n_sim):
    pdata['xfile']        = str(k)
#     sim.simulate_simple(**pdata)

print('we have done %d simulations'%n_sim)

we have done 100 simulations


In [4]:
''' Infer selection coefficients'''

reload(sim)

for n in range(n_sim):
    pdata['xfile'] = str(n)
    # sim.infer_simple(**pdata)

print('we have done inference for %d simulations '%n_sim)

we have done inference for 100 simulations 


In [5]:
''' Collect simulation data'''

simple_dir = pdata['dir']
output_dir = pdata['output_dir']
beta       = pdata['beta']

const_num = 6
f = open('%s/%s/mpl_collected_%s.csv'%(SIM_DIR,simple_dir,beta),'w')
f.write('trajectory,ns,delta_t')
for i in range(const_num):
    f.write(',sc_%d'%i)
f.write('\n')

for k in range(n_sim):
    name = str(k)
    data_full   = np.load('%s/%s/%s/c_%s.npz'%(SIM_DIR,simple_dir,output_dir,name), allow_pickle="True")
    sc_full     = data_full['selection']
    TimeVaryingSC = [np.average(sc_full[i]) for i in range(const_num)]
    f.write('%d,1000,1'%k)
    for i in range(const_num):
        f.write(',%f'%TimeVaryingSC[i])
    f.write('\n')
f.close()

print('collect all coefficients for %d simulations'%(n_sim))

collect all coefficients for 100 simulations


<a id='simple-gamma'></a>
#### Try different $\gamma^{\prime}$
1. constant $\gamma^{\prime}$
2. small $\gamma^{\prime}$ at the boundary (middle value * 0.25)

In [6]:
''' Use different $\gamma^{\prime}$'''

reload(sim)

for n in range(n_sim):
    pdata['xfile'] = str(n)
    
    '''Smaller gammma^{prime} at the ends'''
    pdata['output_dir'] = 'output_0.25'
    pdata['beta'] = 0.25
    # sim.infer_simple(**pdata)

    '''Concant gammma^{prime} '''
    pdata['output_dir'] = 'output_1'
    pdata['beta'] = 1
    # sim.infer_simple(**pdata)

print('we have done inference for %d simulations in both cases'%n_sim)

we have done inference for 100 simulations in both cases


In [7]:
''' Collect simulation data'''

reload(sim)

betas = [0.25, 1]
const_num = 6

for beta in betas:
    # write the constant reference result
    f = open('%s/%s/mpl_collected_%s.csv'%(SIM_DIR,simple_dir,beta),'w')
    f.write('trajectory,ns,delta_t')
    for i in range(const_num):
        f.write(',sc_%d'%i)
    f.write('\n')
    
    for k in range(n_sim):
        name = str(k)
        data_full   = np.load('%s/%s/output_%s/c_%s.npz'%(SIM_DIR,simple_dir,beta,name), allow_pickle="True")
        sc_full     = data_full['selection']
        TimeVaryingSC = [np.average(sc_full[i]) for i in range(const_num)]
        f.write('%d,1000,1'%k)
        for i in range(const_num):
            f.write(',%f'%TimeVaryingSC[i])
        f.write('\n')
    f.close()

print('we have collected all constant coefficientse for %d simulations for both cases'%n_sim)

we have collected all constant coefficientse for 100 simulations for both cases


<a id='simple-tv'></a>
#### Consider all sites are time-varying
The fitness model work like this:
$f_a = 1 + \sum_i^{\ell} s_i g_i^a$

Model is a little different from the last section. Here we use 2 benefial (*s = 0.03*, [0,1]), 2 neutral (s = 0, [2,3]), 2 delerious (*s = -0.03*, [4,5]) and 4 time varying site (s = 0.04 * sin(2$\pi$t/$\tau$)[6,7] and s = 0.04 * cos(2$\pi$t/$\tau$)[8,9])

In [8]:
''' Parameter '''

importlib.reload(sim)

generations = 1000
fi_1 = np.zeros(generations+1)
fi_2 = np.zeros(generations+1)

for t in range(len(fi_1)):
    fi_1[t] = np.sin(t*2*np.pi/generations)*0.04
    fi_2[t] = np.cos(t*2*np.pi/generations)*0.04
    
pdata = {  
    'NUC':           ['A', 'T'],      # all possible alleles
    'dir':           'simple-new',    # directory of this simulation
    'xfile':         '0',             # output file name
    'output_dir':    'output',        # directory of reference result
    'seq_length':    10,              # sequence length
    'pop_size':      1000,            # population size
    'generations':   generations,     # number of total generations
    'mut_rate':      1e-3,            # mutation rate
    'rec_rate':      1e-3,            # recombination rate
    'inital_state':  4,               # number of initial sub-population
    'bene':          [0,1],           # constant beneficial mutations sites
    'dele':          [4,5],           # constant deleterious mutations sites
    'p_1':           [6,7],           # time-varying mutations sites (sin)
    'p_2':           [8,9],           # time-varying mutations sites (cos)
    's_ben':         0.03,            # selection coefficient of beneficial mutations
    's_del':         -0.03,           # selection coefficient of deleterious mutations
    'fi_1':          fi_1,            # time-varying selection coefficient for individual site (sin)
    'fi_2':          fi_2,            # time-varying selection coefficient for individual site (cos)
    'gamma_s':       1,               # regularization - selection coefficients - constant part
    'gamma_2c':      100000,          # regularization - the time derivative of the selection coefficients
    'gamma_2tv':     200,             # regularization - the time derivative of the selection coefficients
    'theta':         0.5,             # magnification of extended time at the ends
    'beta':          4,               # magnification of extended gamma_2 at the ends
    }

# Change
# pdata['dir'] = 'simple-new'
# pdata['s_ben'] = 0.03
# pdata['s_del'] = -0.03

In [9]:
''' Create simulated evolutionary trajectories'''

reload(sim)

n_sim   = 100

# simulation
for k in range(n_sim):
    pdata['xfile']        = str(k)
    # sim.simulate_simple(**pdata)

print('we have done %d simulations'%n_sim)

we have done 100 simulations


In [10]:
''' Infer selection coefficients with considering all mutations are time-varying'''

reload(sim)

pdata['output_dir'] = 'output-tv'
pdata['p_1'] = [0, 1, 2, 3, 4]
pdata['p_2'] = [5, 6, 7, 8, 9]

for n in range(n_sim):
    pdata['xfile'] = str(n)
    # sim.infer_simple(**pdata)

print('Done inference for %d simulations （assuming all mutations are time-varying）'%n_sim)

Done inference for 100 simulations （assuming all mutations are time-varying）


<a id='sim2'></a>
### Simulation with trait

<a id='trait'></a>
#### Wright-Fisher simulations with binary trait terms
The fitness model work like this:
$f_a = 1 + \sum_i^L s_i g_i^a + \sum_n^{N_p} s_n g_n^a$

This simulation begins with 4 random initial type, which only has 2 alleles (wild type and mutant type).

Wright-Fisher simulations are performed using simulation.py. The output of these simulations is saved for processing. 

Benefial [0,1,2,3], delerious[16,17,18,19], trait sites: chosen randomly

In [11]:
''' Parameter '''
importlib.reload(sim)

generations = 1000
fn = np.zeros(generations+1)
fi = np.zeros(generations+1)

for t in range(len(fn)):
    fi[t] = 0.04 - 0.08/generations * t
    fn[t] = 0.06 - 0.06/generations * t

trait_dir = 'trait'
pdata = {  
    'NUC':           ['A', 'T'],      # all possible alleles
    'dir':           trait_dir,       # directory of this simulation
    'xfile':         '0',             # output file name
    'seq_dir':       'sequences',     # directory of input simulation data
    'output_dir':    'output',        # directory of reference result
    'seq_length':    20,              # sequence length
    'pop_size':      1000,            # population size
    'generations':   generations,     # number of total generations
    'totalT':        generations,     # generations used to infer
    'mut_rate':      1e-3,            # mutation rate
    'rec_rate':      1e-3,            # recombination rate
    'inital_state':  4,               # number of initial sub-population
    'n_ben':         4,               # number of beneficial mutations
    'n_del':         4,               # number of deleterious mutations
    'bene':          [0,1,2,3],       # constant beneficial mutations sites
    'dele':          [16,17,18,19],   # constant deleterious mutations sites
    's_ben':         0.02,            # selection coefficient of beneficial mutations
    's_del':         -0.02,           # selection coefficient of deleterious mutations
    'fi':            fi,              # time-varying selection coefficient for individual site
    'fn':            fn,              # time-varying selection coefficient for binary trait
    'escape_group':  [[12,15,17]],    # escape sites
    'escape_TF':     [[0,0,0]],       # wild type sequences for escape sites
    'trait_dis':     [[3,2]],         # distance between trait sites
    'p_sites':       [9,10],          # special sites (not escape sites but still time-varying)
    'x_thresh':      0.005,           # threshold for single allele frequency
    'gamma_s':       1,               # regularization - selection coefficients - constant part
    'gamma_2c':      100000,          # regularization - the time derivative of the selection coefficients
    'gamma_2tv':     200,              # regularization - the time derivative of the selection coefficients
    'theta':         0.5,             # magnification of extended time at the ends
    'beta':          4,               # magnification of extended gamma_2 at the ends
    'bc_n':          True,            # True: Neumann boundary condition; False: Dirichlet boundary condition
    }

#### Use 3 files to restore the information about trait groups. ("traitsites": trait sites , "traitseq": TF sequences for trait sites,"traitdis":distance between 2 neighboring trait sites)

In [12]:
'Create the necessary files'
reload(sim)

n_sim   = 100

# # get random escape groups for 100 simulations
# escape_groups  = []
# special_groups = []
# for n in range(n_sim):
#     random_numbers   = random.sample(range(20), 3)
#     escape_group     = [sorted(random_numbers)]
#     escape_groups.append(escape_group)
    
#     # trait sites 
#     f = open('%s/%s/traitsite/traitsite-%s.dat'%(SIM_DIR,trait_dir,n), 'w')
#     for i in range(len(escape_group)):
#         f.write('%s\n'%'\t'.join([str(ii) for ii in escape_group[i]]))
#     f.close()
    
#     # distance between 2 trait sites
#     f = open('%s/%s/traitdis/traitdis-%s.dat'%(SIM_DIR,trait_dir,n), 'w')
#     for i in range(len(escape_group)):
#         i_dis = []
#         for j in range(len(escape_group[i])-1):
#             i_dis.append(int(escape_group[i][j+1]-escape_group[i][j]))
#         f.write('%s\n'%'\t'.join([str(ii) for ii in i_dis]))
#     f.close()
    

# # trait sequence 
# f = open('%s/%s/traitseq.dat'%(SIM_DIR,trait_dir), 'w')
# for i in range(1):
#     f.write('%s\n'%'\t'.join([str(0) for j in range(3)]))
# f.close()
    
# # save and load escape group information
# with open("%s/%s/escape_groups.dat"%(SIM_DIR,trait_dir), 'w') as file:
#     json.dump(escape_groups, file)

In [13]:
''' Create simulated evolutionary trajectories'''

importlib.reload(sim)

with open("%s/%s/escape_groups.dat"%(SIM_DIR,trait_dir), 'r') as file:
    escape_groups = json.load(file)
    
# simulation
for k in range(n_sim):
    pdata['xfile']        = str(k)
    pdata['escape_group'] = escape_groups[k]
#     sim.simulate_trait(**pdata)

print('we have done %d times simulations'%n_sim)

we have done 100 times simulations


In [14]:
''' Infer the results'''

importlib.reload(sim)

for n in range(n_sim):
    pdata['xfile']        = str(n)
    # sim.infer_trait(**pdata)

print('we have done inference for %d simulations '%n_sim)

we have done inference for 100 simulations 


In [15]:
'Create a csv file to store the results of all simulations'

nB = pdata['n_ben']
nD = pdata['n_del']
ne = len(pdata['escape_group'])

seq_length = pdata['seq_length']
p_sites    = [9,10]
out_dir    = ''
trait_dir = pdata['dir']

# Neumann boundary condition
f = open('%s/%s/mpl_collected%s.csv'%(SIM_DIR,trait_dir,out_dir),'w')
f.write('trajectory,ns,delta_t')
for i in range(seq_length):
    if i not in p_sites:
        f.write(',sc_%d'%i)
f.write('\n')

for k in range(100):
    name = str(k)
    data_full   = np.load('%s/%s/output%s/c_%s.npz'%(SIM_DIR,trait_dir,out_dir,name), allow_pickle="True")
    sc_full     = data_full['selection']
    TimeVaryingSC = [np.average(sc_full[i]) for i in range(seq_length)]
    f.write('%d,1000,1'%k)
    for i in range(seq_length):
        if i not in p_sites:
            f.write(',%f'%TimeVaryingSC[i])
    f.write('\n')
f.close()

In [16]:
' Get inference results with different $\gamma^{\prime}$'

reload(sim)

for n in range(n_sim):
    pdata['xfile'] = str(n)
    
    '''Smaller gammma^{prime} at the ends'''
    pdata['output_dir'] = 'output_0.25'
    pdata['beta'] = 0.25
    # sim.infer_trait(**pdata)

    '''Constant gammma^{prime} '''
    pdata['output_dir'] = 'output_1'
    pdata['beta'] = 1
    # sim.infer_trait(**pdata)

print('we have done inference for %d simulations in both cases'%n_sim)

we have done inference for 100 simulations in both cases


#### <a id='nsdt'></a> Finite sample data inference

For one simulation, use different $\Delta t$ to get the result. In each case, try different $\gamma^{\prime}$
1. dt = 10
2. dt = 20
3. dt = 100
4. random time, 11 time points[0, 10, 25, 70, 135, 250, 400, 590, 720, 880, 1000]

In [17]:
importlib.reload(sim)
'''cut sequence and then infer for different dt'''

dts = [10, 20, 1000]
for dt in dts:
    observed_time = np.linspace(0, 1000, int(1000/dt)+1)
    pdata['cut_dir'] = str(dt)
    pdata['cut_time'] = observed_time
    pdata['seq_dir'] = 'cut/'+str(dt)+'/sequences'
    output_dir = 'cut/'+str(dt)+'/output'
    
    for n in range(n_sim):
        pdata['xfile']        = str(n)
        # sim.cut_seq(**pdata)
    
    for n in range(n_sim):
        pdata['xfile']        = str(n)
        
        '''Standard gammma^{prime} at the ends'''
        pdata['output_dir'] = output_dir
        pdata['beta'] = 4
        # sim.infer_trait(**pdata)
    
        '''Smaller gammma^{prime} at the ends'''
        pdata['output_dir'] = output_dir + '_0.25'
        pdata['beta'] = 0.25
        # sim.infer_trait(**pdata)
    
        '''Constant gammma^{prime} '''
        pdata['output_dir'] = output_dir + '_1'
        pdata['beta'] = 1
        # sim.infer_trait(**pdata)

    print('Done inference for %d simulations with different γ\' when dt = %d'%(n_sim, dt))

Done inference for 100 simulations with different γ' when dt = 10
Done inference for 100 simulations with different γ' when dt = 20
Done inference for 100 simulations with different γ' when dt = 1000


In [18]:
importlib.reload(sim)
'''cut sequence and then infer'''

pdata['cut_dir']  = 'random'
pdata['cut_time'] = [0, 10, 25, 70, 135, 250, 400, 590, 720, 880, 1000]
pdata['seq_dir']  = 'cut/random/sequences'
output_dir        = 'cut/random/output'

for n in range(n_sim):
    pdata['xfile']        = str(n)
#     sim.cut_seq(**pdata)

for n in range(n_sim):
    pdata['xfile']        = str(n)
    
    '''Standard gammma^{prime} at the ends'''
    pdata['output_dir'] = output_dir
    pdata['beta'] = 4
    # sim.infer_trait(**pdata)

    '''Smaller gammma^{prime} at the ends'''
    pdata['output_dir'] = output_dir + '_0.25'
    pdata['beta'] = 0.25
    # sim.infer_trait(**pdata)

    '''Constant gammma^{prime} '''
    pdata['output_dir'] = output_dir + '_1'
    pdata['beta'] = 1
    # sim.infer_trait(**pdata)

print('Done inference for %d simulations with different γ\' for limited time points'%(n_sim))

Done inference for 100 simulations with different γ' for limited time points
