In [1]:
import numpy as np
import pandas as pd
import pygrgl
import random

from grg_pheno_sim.phenotype import sim_phenotypes_custom


This notebook contains demos for the user to input custom effect sizes instead of using one of the distribution-based models provided in the library. We allow for input types for effect sizes to be either a list, a dictionary, or a pandas dataframe. This is for the univariate case. In the multivariate case, the user must input a pandas dataframe with effect sizes that follows the format shown in multivariate demos for it to be compatible with the simulation library.

The following command only serves the purpose of converting the VCF zip file into a GRG that will be used for the phenotype simulation.

In [2]:
%%script bash --out /dev/null
if [ ! -f test-200-samples.grg ]; then
  grg construct -p 10 -t 2 ../data/test-200-samples.vcf.gz --out-file test-200-samples.grg
fi

In [3]:
grg_1 = pygrgl.load_immutable_grg("test-200-samples.grg") #loading in a sample grg stored in the same directory
n = grg_1.num_mutations

In [4]:
random_effects = [random.random() for _ in range(n)] #list input

specific_effects = [1.0 for _ in range(n)] #list input, non-random inputs

effect_sizes = np.random.randn(n)  

mutation_dict = {i: effect_sizes[i] for i in range(n)} #dictionary input

input_df = pd.DataFrame(list(mutation_dict.items()), columns=['mutation_id', 'effect_size']) #dataframe input

input_df_manual = pd.DataFrame(list(mutation_dict.items()), columns=['mutation_id', 'effect_size']) #dataframe input
input_df_manual['causal_mutation_id']=0


We first show custom effect sizes contained within a list.

In [5]:
normalize_genetic_values_before_noise = True

heritability = 0.33

standardized_output = True

output_path = 'custom_pheno.phen' #define the path to be saved at, this output is saved in the file of this name in the same directory

phenotypes_list = sim_phenotypes_custom(grg_1, specific_effects, normalize_genetic_values_before_noise=normalize_genetic_values_before_noise, heritability=heritability, standardized_output=standardized_output, path=output_path)
phenotypes_list

The initial effect sizes are 
       mutation_id  effect_size  causal_mutation_id
0                0          1.0                   0
1                1          1.0                   0
2                2          1.0                   0
3                3          1.0                   0
4                4          1.0                   0
...            ...          ...                 ...
10888        10888          1.0                   0
10889        10889          1.0                   0
10890        10890          1.0                   0
10891        10891          1.0                   0
10892        10892          1.0                   0

[10893 rows x 3 columns]
The genetic values of the individuals are 
     individual_id  genetic_value  causal_mutation_id
0                0         2665.0                   0
1                1         2729.0                   0
2                2         2740.0                   0
3                3         2773.0                   0
4      

Unnamed: 0,causal_mutation_id,individual_id,genetic_value,environmental_noise,phenotype
0,0,0,-1.884264,0.950101,-0.934163
1,0,1,-0.150362,0.187411,0.037050
2,0,2,0.147653,-3.207314,-3.059661
3,0,3,1.041696,0.614566,1.656261
4,0,4,-0.150362,2.481596,2.331234
...,...,...,...,...,...
195,0,195,1.339710,-1.724298,-0.384587
196,0,196,-1.396604,-0.578220,-1.974824
197,0,197,-0.800575,-0.133496,-0.934071
198,0,198,-0.692206,1.464709,0.772503


We then show custom effect sizes contained within a dictionary.

In [6]:
normalize_genetic_values_before_noise = True

heritability = 0.33

#by default, the standard .phen output will not be saved

phenotypes_dict = sim_phenotypes_custom(grg_1, mutation_dict, normalize_genetic_values_before_noise=normalize_genetic_values_before_noise, heritability=heritability)
phenotypes_dict

The initial effect sizes are 
       mutation_id  effect_size  causal_mutation_id
0                0     0.174430                   0
1                1    -0.493785                   0
2                2    -0.108830                   0
3                3     0.271743                   0
4                4     0.081619                   0
...            ...          ...                 ...
10888        10888    -0.507485                   0
10889        10889    -1.161496                   0
10890        10890    -0.596188                   0
10891        10891    -0.386963                   0
10892        10892     0.970477                   0

[10893 rows x 3 columns]
The genetic values of the individuals are 
     individual_id  genetic_value  causal_mutation_id
0                0     -38.280037                   0
1                1     -42.299913                   0
2                2      24.824838                   0
3                3       3.064312                   0
4      

Unnamed: 0,causal_mutation_id,individual_id,genetic_value,environmental_noise,phenotype
0,0,0,-0.273825,2.185231,1.911405
1,0,1,-0.378820,0.649656,0.270836
2,0,2,1.374414,-0.597548,0.776865
3,0,3,0.806050,-1.668057,-0.862008
4,0,4,-0.782916,0.773616,-0.009300
...,...,...,...,...,...
195,0,195,1.002198,0.977937,1.980135
196,0,196,-1.090890,-1.505821,-2.596711
197,0,197,-1.188372,-0.546668,-1.735040
198,0,198,-0.513046,3.260040,2.746994


We finally show custom effect sizes contained within a pandas dataframe (the user need not add the causal mutation id column - that is handled internally).

In [7]:
normalize_genetic_values_before_noise = True

heritability = 0.33

phenotypes_df = sim_phenotypes_custom(grg_1, input_df, normalize_genetic_values_before_noise=normalize_genetic_values_before_noise, heritability=heritability)
phenotypes_df

The initial effect sizes are 
       mutation_id  effect_size  causal_mutation_id
0                0     0.174430                   0
1                1    -0.493785                   0
2                2    -0.108830                   0
3                3     0.271743                   0
4                4     0.081619                   0
...            ...          ...                 ...
10888        10888    -0.507485                   0
10889        10889    -1.161496                   0
10890        10890    -0.596188                   0
10891        10891    -0.386963                   0
10892        10892     0.970477                   0

[10893 rows x 3 columns]
The genetic values of the individuals are 
     individual_id  genetic_value  causal_mutation_id
0                0     -38.280037                   0
1                1     -42.299913                   0
2                2      24.824838                   0
3                3       3.064312                   0
4      

Unnamed: 0,causal_mutation_id,individual_id,genetic_value,environmental_noise,phenotype
0,0,0,-0.273825,0.884034,0.610209
1,0,1,-0.378820,0.067479,-0.311341
2,0,2,1.374414,-1.347188,0.027226
3,0,3,0.806050,0.037719,0.843769
4,0,4,-0.782916,-0.709292,-1.492207
...,...,...,...,...,...
195,0,195,1.002198,1.168733,2.170931
196,0,196,-1.090890,1.260034,0.169144
197,0,197,-1.188372,0.808917,-0.379455
198,0,198,-0.513046,-1.106606,-1.619652


Alternatively, the user can also use his custom effect sizes (enclosed within a compatible dataframe) and manually build the consecutive steps of the simulation instead of using the sim_phenotypes_custom function. For this, the dataframe (for the univariate case) will have to be formed as shown for the df `input_df_manual` above.

Now, we show how the user can simulate custom phenotypes using custom noise.

In [8]:
normalize_genetic_values_before_noise = True

mean_1 = 0
std_1 = 1

phenotypes_df_user_noise = sim_phenotypes_custom(grg_1, input_df, normalize_genetic_values_before_noise=normalize_genetic_values_before_noise, user_mean=mean_1, user_cov=std_1)
phenotypes_df_user_noise

The initial effect sizes are 
       mutation_id  effect_size  causal_mutation_id
0                0     0.174430                   0
1                1    -0.493785                   0
2                2    -0.108830                   0
3                3     0.271743                   0
4                4     0.081619                   0
...            ...          ...                 ...
10888        10888    -0.507485                   0
10889        10889    -1.161496                   0
10890        10890    -0.596188                   0
10891        10891    -0.386963                   0
10892        10892     0.970477                   0

[10893 rows x 3 columns]
The genetic values of the individuals are 
     individual_id  genetic_value  causal_mutation_id
0                0     -38.280037                   0
1                1     -42.299913                   0
2                2      24.824838                   0
3                3       3.064312                   0
4      

Unnamed: 0,causal_mutation_id,individual_id,genetic_value,environmental_noise,phenotype
0,0,0,-0.273825,-0.373782,-0.647607
1,0,1,-0.378820,0.623412,0.244592
2,0,2,1.374414,-0.106782,1.267632
3,0,3,0.806050,-0.853337,-0.047287
4,0,4,-0.782916,-0.729869,-1.512784
...,...,...,...,...,...
195,0,195,1.002198,0.218603,1.220801
196,0,196,-1.090890,0.440415,-0.650475
197,0,197,-1.188372,1.483927,0.295555
198,0,198,-0.513046,-0.749486,-1.262532
