In [1]:
import numpy as np
import pandas as pd
import pygrgl
import random

import sys
sys.path.append('/Users/adityasyam/grg_pheno_sim') 

from grg_pheno_sim.phenotype import sim_phenotypes_custom


This notebook contains demos for the user to input custom effect sizes instead of using one of the distribution-based models provided in the library. We allow for input types for effect sizes to be either a list, a dictionary, or a pandas dataframe. This is for the univariate case. In the multivariate case, the user must input a pandas dataframe with effect sizes that follows the format shown in multivariate demos for it to be compatible with the simulation library.

The following command only serves the purpose of converting the VCF zip file into a GRG that will be used for the phenotype simulation.

In [2]:
%%script bash --out /dev/null
echo "Test"
grg construct --no-maf-flip -p 10 -t 2 ../data/test-200-samples.vcf.gz --out-file test-200-samples.grg

Construction took 42 ms
Wrote GRG to test-200-samples.vcf.gz.part0.tree0.grg in 0 ms
Construction took 54 ms
Wrote GRG to test-200-samples.vcf.gz.part0.tree1.grg in 0 ms
Construction took 0 ms
Mapping mutations took262 ms
Wrote GRG to test-200-samples.vcf.gz.part0.grg in 6 ms
Construction took 45 ms
Wrote GRG to test-200-samples.vcf.gz.part1.tree0.grg in 0 ms
Construction took 49 ms
Wrote GRG to test-200-samples.vcf.gz.part1.tree1.grg in 0 ms
Construction took 0 ms
Mapping mutations took324 ms
Wrote GRG to test-200-samples.vcf.gz.part1.grg in 3 ms
Construction took 52 ms
Wrote GRG to test-200-samples.vcf.gz.part2.tree0.grg in 0 ms
Construction took 47 ms
Wrote GRG to test-200-samples.vcf.gz.part2.tree1.grg in 0 ms
Construction took 0 ms
Mapping mutations took438 ms
Wrote GRG to test-200-samples.vcf.gz.part2.grg in 4 ms
Construction took 59 ms
Wrote GRG to test-200-samples.vcf.gz.part3.tree0.grg in 0 ms
Construction took 54 ms
Wrote GRG to test-200-samples.vcf.gz.part3.tree1.grg in 0 ms

In [3]:
grg_1 = pygrgl.load_immutable_grg("test-200-samples.grg") #loading in a sample grg stored in the same directory
n = grg_1.num_mutations

In [4]:
random_effects = [random.random() for _ in range(n)] #list input

specific_effects = [1.0 for _ in range(n)] #list input, non-random inputs

effect_sizes = np.random.randn(n)  

mutation_dict = {i: effect_sizes[i] for i in range(n)} #dictionary input

input_df = pd.DataFrame(list(mutation_dict.items()), columns=['mutation_id', 'effect_size']) #dataframe input

input_df_manual = pd.DataFrame(list(mutation_dict.items()), columns=['mutation_id', 'effect_size']) #dataframe input
input_df_manual['causal_mutation_id']=0


We first show custom effect sizes contained within a list.

In [5]:
normalize_genetic_values_before_noise = True

noise_heritability = 0.33

standardized_output = True

output_path = 'custom_pheno.phen' #define the path to be saved at, this output is saved in the file of this name in the same directory

phenotypes_list = sim_phenotypes_custom(grg_1, specific_effects, normalize_genetic_values_before_noise=normalize_genetic_values_before_noise, noise_heritability=noise_heritability, standardized_output=standardized_output, path=output_path)
phenotypes_list

The initial effect sizes are 
       mutation_id  effect_size  causal_mutation_id
0                0          1.0                   0
1                1          1.0                   0
2                2          1.0                   0
3                3          1.0                   0
4                4          1.0                   0
...            ...          ...                 ...
10888        10888          1.0                   0
10889        10889          1.0                   0
10890        10890          1.0                   0
10891        10891          1.0                   0
10892        10892          1.0                   0

[10893 rows x 3 columns]
The genetic values of the individuals are 
     individual_id  genetic_value  causal_mutation_id
0                0         2665.0                   0
1                1         2729.0                   0
2                2         2740.0                   0
3                3         2773.0                   0
4      

Unnamed: 0,causal_mutation_id,individual_id,genetic_value,environmental_noise,phenotype
0,0,0,-1.139590,0.759913,-0.379677
1,0,1,-0.090938,-0.563209,-0.654147
2,0,2,0.089299,-0.738139,-0.648839
3,0,3,0.630011,1.068854,1.698864
4,0,4,-0.090938,-0.432871,-0.523808
...,...,...,...,...,...
195,0,195,0.810248,-1.492313,-0.682066
196,0,196,-0.844657,-0.361731,-1.206388
197,0,197,-0.484182,-0.420650,-0.904832
198,0,198,-0.418642,0.253474,-0.165167


We then show custom effect sizes contained within a dictionary.

In [6]:
normalize_genetic_values_before_noise = True

noise_heritability = 0.33

#by default, the standard .phen output will not be saved

phenotypes_dict = sim_phenotypes_custom(grg_1, mutation_dict, normalize_genetic_values_before_noise=normalize_genetic_values_before_noise, noise_heritability=noise_heritability)
phenotypes_dict

The initial effect sizes are 
       mutation_id  effect_size  causal_mutation_id
0                0     0.006846                   0
1                1    -1.449547                   0
2                2    -0.301762                   0
3                3    -0.436168                   0
4                4     0.557897                   0
...            ...          ...                 ...
10888        10888     0.471821                   0
10889        10889    -0.687209                   0
10890        10890     0.521598                   0
10891        10891    -1.104360                   0
10892        10892     0.951039                   0

[10893 rows x 3 columns]
The genetic values of the individuals are 
     individual_id  genetic_value  causal_mutation_id
0                0     -44.494730                   0
1                1      -6.226356                   0
2                2    -139.030589                   0
3                3      32.921779                   0
4      

Unnamed: 0,causal_mutation_id,individual_id,genetic_value,environmental_noise,phenotype
0,0,0,-0.165696,-0.217767,-0.383464
1,0,1,0.403616,0.352432,0.756048
2,0,2,-1.572090,0.008561,-1.563529
3,0,3,0.986016,-1.151668,-0.165652
4,0,4,0.318533,-1.065328,-0.746795
...,...,...,...,...,...
195,0,195,0.184091,-0.494017,-0.309926
196,0,196,-0.309442,-0.094950,-0.404392
197,0,197,0.179710,0.816509,0.996219
198,0,198,-0.217245,0.146144,-0.071100


We finally show custom effect sizes contained within a pandas dataframe (the user need not add the causal mutation id column - that is handled internally).

In [7]:
normalize_genetic_values_before_noise = True

noise_heritability = 0.33

phenotypes_df = sim_phenotypes_custom(grg_1, input_df, normalize_genetic_values_before_noise=normalize_genetic_values_before_noise, noise_heritability=noise_heritability)
phenotypes_df

The initial effect sizes are 
       mutation_id  effect_size  causal_mutation_id
0                0     0.006846                   0
1                1    -1.449547                   0
2                2    -0.301762                   0
3                3    -0.436168                   0
4                4     0.557897                   0
...            ...          ...                 ...
10888        10888     0.471821                   0
10889        10889    -0.687209                   0
10890        10890     0.521598                   0
10891        10891    -1.104360                   0
10892        10892     0.951039                   0

[10893 rows x 3 columns]
The genetic values of the individuals are 
     individual_id  genetic_value  causal_mutation_id
0                0     -44.494730                   0
1                1      -6.226356                   0
2                2    -139.030589                   0
3                3      32.921779                   0
4      

Unnamed: 0,causal_mutation_id,individual_id,genetic_value,environmental_noise,phenotype
0,0,0,-0.163151,-0.786583,-0.949734
1,0,1,0.397416,0.333175,0.730591
2,0,2,-1.547940,1.384830,-0.163111
3,0,3,0.970869,0.293755,1.264625
4,0,4,0.313640,-1.066935,-0.753295
...,...,...,...,...,...
195,0,195,0.181263,-1.560362,-1.379099
196,0,196,-0.304688,0.775898,0.471210
197,0,197,0.176949,0.050320,0.227269
198,0,198,-0.213907,1.275329,1.061422


Alternatively, the user can also use his custom effect sizes (enclosed within a compatible dataframe) and manually build the consecutive steps of the simulation instead of using the sim_phenotypes_custom function. For this, the dataframe (for the univariate case) will have to be formed as shown for the df `input_df_manual` above.

Now, we show how the user can simulate custom phenotypes using custom noise.

In [8]:
normalize_genetic_values_before_noise = True

mean_1 = 0
std_1 = 1

phenotypes_df_user_noise = sim_phenotypes_custom(grg_1, input_df, normalize_genetic_values_before_noise=normalize_genetic_values_before_noise, user_mean=mean_1, user_cov=std_1)
phenotypes_df_user_noise

The initial effect sizes are 
       mutation_id  effect_size  causal_mutation_id
0                0     0.006846                   0
1                1    -1.449547                   0
2                2    -0.301762                   0
3                3    -0.436168                   0
4                4     0.557897                   0
...            ...          ...                 ...
10888        10888     0.471821                   0
10889        10889    -0.687209                   0
10890        10890     0.521598                   0
10891        10891    -1.104360                   0
10892        10892     0.951039                   0

[10893 rows x 3 columns]
The genetic values of the individuals are 
     individual_id  genetic_value  causal_mutation_id
0                0     -44.494730                   0
1                1      -6.226356                   0
2                2    -139.030589                   0
3                3      32.921779                   0
4      

Unnamed: 0,causal_mutation_id,individual_id,genetic_value,environmental_noise,phenotype
0,0,0,-0.205281,-0.438482,-0.643763
1,0,1,0.500041,-0.326659,0.173382
2,0,2,-1.947667,-0.326368,-2.274035
3,0,3,1.221578,0.441272,1.662849
4,0,4,0.394632,0.040602,0.435234
...,...,...,...,...,...
195,0,195,0.228071,-0.420339,-0.192268
196,0,196,-0.383368,-0.483245,-0.866614
197,0,197,0.222643,-0.106128,0.116515
198,0,198,-0.269145,-0.635459,-0.904604
