This notebook contains demos for the user to input custom effect sizes instead of using one of the distribution-based models provided in the library. We allow for input types for effect sizes to be either a list, a dictionary, or a pandas dataframe. This is for the univariate case. In the multivariate case, the user must input a pandas dataframe with effect sizes that follows the format shown in multivariate demos for it to be compatible with the simulation library.

In [1]:
import numpy as np
import pandas as pd
import pygrgl
import random

from grg_pheno_sim.phenotype import sim_phenotypes_custom


The following command only serves the purpose of converting the VCF zip file into a GRG that will be used for the phenotype simulation. The bash script below will function as expected given the relative path for the source data file is accurate.

In [2]:
%%script bash --out /dev/null
if [ ! -f test-200-samples.grg ]; then
  grg construct -p 10 ../data/test-200-samples.vcf.gz --out-file test-200-samples.grg
fi

In [3]:
grg_1 = pygrgl.load_immutable_grg("test-200-samples.grg") #loading in a sample grg stored in the same directory
n = grg_1.num_mutations

In [4]:
random_effects = [random.random() for _ in range(n)] #list input

specific_effects = [1.0 for _ in range(n)] #list input, non-random inputs

effect_sizes = np.random.randn(n)  

mutation_dict = {i: effect_sizes[i] for i in range(n)} #dictionary input

input_df = pd.DataFrame(list(mutation_dict.items()), columns=['mutation_id', 'effect_size']) #dataframe input

input_df_manual = pd.DataFrame(list(mutation_dict.items()), columns=['mutation_id', 'effect_size']) #dataframe input
input_df_manual['causal_mutation_id']=0


We first show custom effect sizes contained within a list.

In [5]:
normalize_genetic_values_before_noise = True

heritability = 0.33

standardized_output = True

output_path = 'custom_pheno.phen' #define the path to be saved at, this output is saved in the file of this name in the same directory

phenotypes_list = sim_phenotypes_custom(grg_1, specific_effects, normalize_genetic_values_before_noise=normalize_genetic_values_before_noise, heritability=heritability, standardized_output=standardized_output, path=output_path)
phenotypes_list

The initial effect sizes are 
       mutation_id  effect_size  causal_mutation_id
0                0          1.0                   0
1                1          1.0                   0
2                2          1.0                   0
3                3          1.0                   0
4                4          1.0                   0
...            ...          ...                 ...
10888        10888          1.0                   0
10889        10889          1.0                   0
10890        10890          1.0                   0
10891        10891          1.0                   0
10892        10892          1.0                   0

[10893 rows x 3 columns]
The genetic values of the individuals are 
     individual_id  genetic_value  causal_mutation_id
0                0         2665.0                   0
1                1         2729.0                   0
2                2         2740.0                   0
3                3         2773.0                   0
4      

Unnamed: 0,causal_mutation_id,individual_id,genetic_value,environmental_noise,phenotype
0,0,0,-1.884264,2.677648,0.793385
1,0,1,-0.150362,1.162785,1.012423
2,0,2,0.147653,2.712520,2.860173
3,0,3,1.041696,1.706736,2.748431
4,0,4,-0.150362,-2.463950,-2.614312
...,...,...,...,...,...
195,0,195,1.339710,-2.350201,-1.010491
196,0,196,-1.396604,0.603103,-0.793500
197,0,197,-0.800575,0.398694,-0.401881
198,0,198,-0.692206,1.513843,0.821637


We then show custom effect sizes contained within a dictionary.

In [6]:
normalize_genetic_values_before_noise = True

heritability = 0.33

#by default, the standard .phen output will not be saved

phenotypes_dict = sim_phenotypes_custom(grg_1, mutation_dict, normalize_genetic_values_before_noise=normalize_genetic_values_before_noise, heritability=heritability)
phenotypes_dict

The initial effect sizes are 
       mutation_id  effect_size  causal_mutation_id
0                0    -0.641312                   0
1                1    -0.889631                   0
2                2     0.732203                   0
3                3     1.241809                   0
4                4     0.174566                   0
...            ...          ...                 ...
10888        10888    -0.608520                   0
10889        10889    -0.340360                   0
10890        10890     0.446402                   0
10891        10891     0.043544                   0
10892        10892    -0.029246                   0

[10893 rows x 3 columns]
The genetic values of the individuals are 
     individual_id  genetic_value  causal_mutation_id
0                0      36.294256                   0
1                1      37.397362                   0
2                2      59.498796                   0
3                3      58.527239                   0
4      

Unnamed: 0,causal_mutation_id,individual_id,genetic_value,environmental_noise,phenotype
0,0,0,-0.780890,0.080969,-0.699922
1,0,1,-0.751310,-0.574883,-1.326194
2,0,2,-0.158653,-0.276773,-0.435426
3,0,3,-0.184706,0.067794,-0.116911
4,0,4,-0.115725,2.222999,2.107274
...,...,...,...,...,...
195,0,195,-1.568149,1.127886,-0.440263
196,0,196,0.691328,-1.230243,-0.538915
197,0,197,-0.604501,-0.225780,-0.830280
198,0,198,0.676833,3.721745,4.398578


We finally show custom effect sizes contained within a pandas dataframe (the user need not add the causal mutation id column - that is handled internally).

In [7]:
normalize_genetic_values_before_noise = True

heritability = 0.33

phenotypes_df = sim_phenotypes_custom(grg_1, input_df, normalize_genetic_values_before_noise=normalize_genetic_values_before_noise, heritability=heritability)
phenotypes_df

The initial effect sizes are 
       mutation_id  effect_size  causal_mutation_id
0                0    -0.641312                   0
1                1    -0.889631                   0
2                2     0.732203                   0
3                3     1.241809                   0
4                4     0.174566                   0
...            ...          ...                 ...
10888        10888    -0.608520                   0
10889        10889    -0.340360                   0
10890        10890     0.446402                   0
10891        10891     0.043544                   0
10892        10892    -0.029246                   0

[10893 rows x 3 columns]
The genetic values of the individuals are 
     individual_id  genetic_value  causal_mutation_id
0                0      36.294256                   0
1                1      37.397362                   0
2                2      59.498796                   0
3                3      58.527239                   0
4      

Unnamed: 0,causal_mutation_id,individual_id,genetic_value,environmental_noise,phenotype
0,0,0,-0.780890,-1.686755,-2.467646
1,0,1,-0.751310,-0.058329,-0.809639
2,0,2,-0.158653,1.045812,0.887159
3,0,3,-0.184706,0.424054,0.239348
4,0,4,-0.115725,2.216632,2.100907
...,...,...,...,...,...
195,0,195,-1.568149,-0.829692,-2.397841
196,0,196,0.691328,-0.450031,0.241297
197,0,197,-0.604501,-4.004007,-4.608508
198,0,198,0.676833,-0.113006,0.563828


Alternatively, the user can also use his custom effect sizes (enclosed within a compatible dataframe) and manually build the consecutive steps of the simulation instead of using the sim_phenotypes_custom function. For this, the dataframe (for the univariate case) will have to be formed as shown for the df `input_df_manual` above.

Now, we show how the user can simulate custom phenotypes using custom noise.

In [8]:
normalize_genetic_values_before_noise = True

mean_1 = 0
std_1 = 1

phenotypes_df_user_noise = sim_phenotypes_custom(grg_1, input_df, normalize_genetic_values_before_noise=normalize_genetic_values_before_noise, user_mean=mean_1, user_cov=std_1)
phenotypes_df_user_noise

The initial effect sizes are 
       mutation_id  effect_size  causal_mutation_id
0                0    -0.641312                   0
1                1    -0.889631                   0
2                2     0.732203                   0
3                3     1.241809                   0
4                4     0.174566                   0
...            ...          ...                 ...
10888        10888    -0.608520                   0
10889        10889    -0.340360                   0
10890        10890     0.446402                   0
10891        10891     0.043544                   0
10892        10892    -0.029246                   0

[10893 rows x 3 columns]
The genetic values of the individuals are 
     individual_id  genetic_value  causal_mutation_id
0                0      36.294256                   0
1                1      37.397362                   0
2                2      59.498796                   0
3                3      58.527239                   0
4      

Unnamed: 0,causal_mutation_id,individual_id,genetic_value,environmental_noise,phenotype
0,0,0,-0.780890,-0.683204,-1.464095
1,0,1,-0.751310,-0.280735,-1.032045
2,0,2,-0.158653,1.013221,0.854568
3,0,3,-0.184706,0.527970,0.343265
4,0,4,-0.115725,-2.171292,-2.287017
...,...,...,...,...,...
195,0,195,-1.568149,0.507472,-1.060677
196,0,196,0.691328,-1.435071,-0.743743
197,0,197,-0.604501,-1.227537,-1.832038
198,0,198,0.676833,-2.255291,-1.578458
