-
Notifications
You must be signed in to change notification settings - Fork 3
/
mr_dgp.rmd
573 lines (458 loc) · 20.4 KB
/
mr_dgp.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
---
title: Data generating process for genetic effects on complex traits with interactions
author: Gibran Hemani
date: '`r format(Sys.Date())`'
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Data generating process for genetic effects on complex traits with interactions}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
Background: [Deep IV](http://proceedings.mlr.press/v70/hartford17a.html) can infer causal effects in which the sample is stratified by one or many covariates, and there is a different causal effect within each stratum.
The covariates don't necessarily need to be directly measured, could have a large number of proxy variables for example.
# Genetic effects
To be realistic, obtain known associations for BMI:
```{r}
library(ieugwasr)
library(dplyr)
library(ggdag)
library(dagitty)
bmi_hits <- ieugwasr::tophits("ukb-b-19953")
str(bmi_hits)
```
There are `r nrow(bmi_hits)` independent instruments for BMI. All the instruments have small effects:
```{r}
bmi_hits$rsq <- 2 * bmi_hits$beta^2 * bmi_hits$eaf * (1 - bmi_hits$eaf)
hist(bmi_hits$rsq, breaks=100)
```
In total the `r nrow(bmi_hits)` BMI hits explain `r sum(bmi_hits$rsq) * 100`% of the variance in BMI.
What is the effect on coronary heart disease?
```{r}
chd_assoc <- ieugwasr::associations(variants=bmi_hits$rsid, id="ieu-a-7")
dat <- merge(bmi_hits, chd_assoc, by="rsid")
str(dat)
summary(lm(beta.y ~ 0 + beta.x, dat, weight=1/dat$se.y^2))
```
This is a pretty big effect. So use that as a ceiling. Need a model to relate variance explained by all the variants, the effect sizes, and the minor allele frequencies (MAFs) of the variants
The variance explained by a variant $G_j$ is related to
$$
V_{G_j} = 2 \beta_j p_j (1-p_j)
$$
where $p_j$ is the minor allele frequency of the variant, and $\beta_j$ is its effect size. What is the distribution of allele frequencies of instruments for BMI?
```{r}
bmi_hits$maf <- bmi_hits$eaf
bmi_hits$maf[bmi_hits$maf > 0.5] <- 1 - bmi_hits$maf[bmi_hits$maf > 0.5]
hist(bmi_hits$maf)
```
And the relationship between MAF and effect size
```{r}
plot(log(bmi_hits$maf), abs(bmi_hits$beta))
```
Deciding how the the effect sizes relate to allele frequency depends on model of natural selection, parameter $S$, which can inform the sampling of the SNP effect sizes like this:
$$
\beta_j \sim N(0, [2p_j (1-p_j)]^S \sigma^2_\beta)
$$
Here $\sigma^2_\beta = V_G / M$ where $M$ is the number of causal variants for the trait and $V_G$ is the variance explained by all those SNPs, which relates to the heritability equation of $h^2 = V_G / V_P$, where $V_P$ is the phenotypic variance.
Problem with this model is that if we assume that all the instruments explain the entire variance of the trait then a lot of the $\beta_j$ values will be too small to be realistic instruments. So the strategy would be to
1. Choose a number of causal variants for the trait (large, e.g. >10000)
2. Sample the allele frequency distribution of all causal variants
3. Sample the effects for all causal variants
4. Retain only those that cumulatively explain some amount of variance. e.g. For BMI, the variance exlained is `r round(sum(bmi_hits$rsq, 2))`, so we just want the set of variants that are required to explain that much variance.
Ultimately, we can draw allele frequencies from some distribution e.g.
$$
p_j \sim Beta(\alpha=2,\beta=2)
$$
This would approx match the frequencies of the instruments for BMI:
```{r}
hist(rbeta(500, 2, 2))
```
But it's probably not representative of the full set of causal variants, because we have more power to detect common variants. Perhaps just use a uniform distribution. Bringing it all together:
```{r}
#' Generate realistic variant effects from GWAS
#'
#' @param af Array of allele frequencies for every variant that a variant effect needs to be generated
#' @param h2 Heritability of the trait (e.g. BMI =~ 0.5)
#' @param S Selection coefficient. 0 = neutral (no selection), +ve = positive selection, -ve = negative selection
#' @param h2_inst Variance explained by the detected variants. e.g. for BMI =~ 0.05
#'
#' @return Data frame
generate_beta_gx <- function(af, h2, S=0, h2_inst)
{
nvariant <- length(af)
if(h2 == 0)
{
return(dplyr::tibble(beta=0, af=af))
}
# Sample variant effects based on S and allele frequency
beta <- rnorm(nvariant, mean=0, sd = sqrt((af * 2 * (1-af))^S))
# Scale effects to be in standard deviation units and to explain the total h2
vg <- sum(af * 2 * (1-af) * beta^2)
ve <- (vg - h2 * vg) / h2
vy <- vg + ve
beta <- beta / sqrt(vy)
# Select variants to be detected as instruments
dat <- dplyr::tibble(beta=beta, af=af)
dat$rsq <- dat$beta^2 * 2 * dat$af * (1 - dat$af)
dat <- dplyr::arrange(dat, desc(rsq))
dat$rsq_cumsum <- cumsum(dat$rsq)
dat$instrument <- dat$rsq_cumsum <= h2_inst
# If the first instrument explains all of instrument variance then just use that
dat$instrument[1] <- TRUE
return(dat)
}
```
Example:
```{r}
bgx <- generate_beta_gx(af = rbeta(40000, 1, 1), h2 = 0.5, S = 0, h2_inst = 0.05)
sum(bgx$instrument)
```
These parameters give something similar to the number of detected hits for BMI. Check that the distribution of rsq values are similar. BMI (black) and simulated (red):
```{r}
bmi_hits %>% dplyr::arrange(desc(rsq)) %>% {plot(.$rsq)}
points(subset(bgx, instrument)$rsq, col="red")
```
The difference here is that a more appropriate model would be that different sets of variants are drawn from different distributions. For example, 1% of the variance could be explained by 10 variants, and the other 4% by 400. variants. The model gets a bit verbose doing it like this, potentially requring a different S parameter for different sets of variants, which doesn't really make sense.
```{r}
generate_beta_gx2 <- function(af, h2, S=0, h2_inst)
{
stopifnot(is.list(af))
stopifnot(length(h2) == length(af))
dat <- list()
for(i in 1:length(af))
{
dat[[i]] <- generate_beta_gx(af=af[[i]], h2=h2[i], S=S, h2_inst = 1)
}
dat <- bind_rows(dat)
dat$rsq <- dat$beta^2 * 2 * dat$af * (1 - dat$af)
dat <- dplyr::arrange(dat, desc(rsq))
dat$rsq_cumsum <- cumsum(dat$rsq)
dat$instrument <- dat$rsq_cumsum <= h2_inst
# If the first instrument explains all of instrument variance then just use that
dat$instrument[1] <- TRUE
return(dat)
}
```
Try again:
```{r}
bgx <- generate_beta_gx2(af=list(runif(10), runif(80), runif(200), runif(90000)), h2=c(0.01, 0.01, 0.01, 0.48), S=0, h2_inst=0.05)
sum(bgx$instrument)
bmi_hits %>% dplyr::arrange(desc(rsq)) %>% {plot(.$rsq)}
points(subset(bgx, instrument)$rsq, col="red")
```
These parameters give something similar to the number and distribution of detected hits for BMI. There are other things that we haven't accounted for here such as winner's curse. Check that the distribution of rsq values are similar.
# Model 1 - simple linear relationship between x and y with large numbers of instruments of small effect
$$
\begin{aligned}
x_i &= a^x + \beta_{xG}G_i + \beta_{xC}C_i + \epsilon^x_i \\
y_i &= a^y + \beta_{yx} x_i + \beta_{yC}C_i + \beta_{yG}G_i + \epsilon^y_i
\end{aligned}
$$
- $x_i$ = exposure value for $i$th individual
- $a^x, a^y$ = intercept terms
- $y_i$ = outcome value for $i$th individual
- $G$ = matrix of genotypes for each sample, only including genotypes to be used as instruments (not all causal variants)
- $\beta_xG$ = vector of effects of each variant in genotype matrix $G$ on $x$
- $\beta_{xC},\beta_{yC}$ = effects of confounders on $x$ and $y$
- $\beta_{yG}$ = vector of effects of each variant on $y$, independent of $x$ (i.e. horizontal pleiotropic effects)
Expect quite substantial confounding and relatively small causal effects. Easiest to talk about in terms of variance explained.
```{r}
#' Title
#'
#' @param sample_size
#' @param af Vector of allele frequencies (should be between 0 and 1)
#'
#' @return Matrix
generate_G <- function(sample_size, af)
{
m <- matrix(0, sample_size, length(af))
for(i in 1:length(af))
{
m[,i] <- rbinom(sample_size, 2, af[i])
}
return(m)
}
#' Scale genetic effects to satisfy heritability parameter
#'
#' Supposing that we want all our sampled effects to explain a particular amount of variance in the trait, we can transform those effects based on the allele frequencies of the variants
#'
#' @param beta vector of effects
#' @param af vector of allele frequencies
#' @param h2 variance explained by the effects (assuming trait variance is 1)
#'
#' @return vector of scaled effects
scale_bg <- function(beta, af, h2)
{
vg <- sum(af * 2 * (1-af) * beta^2)
ve <- (vg - h2 * vg) / h2
vy <- vg + ve
beta <- (beta - mean(beta)) / sqrt(vy)
return(beta)
}
#' Effect of x on y constant, with allowance for balanced pleiotropy
#'
#' @param sample_size Sample size to generate
#' @param h2 Heritability of x
#' @param nvariants Number of causal variants of x
#' @param h2_inst Variance explained by detected instruments
#' @param S Selection coefficient
#' @param maf_alpha Allele frequency distribution shape parameter, to be fed into beta distribution.
#' @param maf_beta Allele frequency distribution shape parameter, to be fed into beta distribution.
#' @param rsq_yx Causal variance explained in y by x
#' @param rsq_xu Variance explained in x by unmeasured confounder
#' @param rsq_yu Variance explained in y by unmeasured confounder
#' @param rsq_yg Horizontal pleiotropy parameter, variance explained by all variants
#'
#' @return List
#' \itemize{
#' \item x - Vector
#' \item y - Vector
#' \item u - Vector
#' \item G - Matrix
#' }
dgp1 <- function(sample_size, h2, nvariants, h2_inst, S, maf_alpha, maf_beta, rsq_yx, rsq_xu, rsq_yu, rsq_yg)
{
stopifnot(rsq_xu + h2_inst <= 1)
stopifnot(rsq_yu + rsq_yx + rsq_yg <= 1)
# Generate allele frequencies for all causal variants
af <- rbeta(nvariants, maf_alpha, maf_beta)
# generate confounder
u <- rnorm(sample_size)
# Generate random errors for x and y
epsilon_x <- rnorm(sample_size, mean=0, sd=sqrt(1 - rsq_xu - h2_inst))
epsilon_y <- rnorm(sample_size, mean=0, sd=sqrt(1 - rsq_yu - rsq_yx - rsq_yg))
# Generate instrument effects on x
bgx <- generate_beta_gx(af=af, h2=h2, S=S, h2_inst=h2_inst) %>%
dplyr::filter(instrument)
# Generate genotype matrix
G <- generate_G(sample_size, bgx$af)
# Generate horizontal pleiotropy effects
bgy <- scale_bg(rnorm(nrow(bgx)), bgx$af, rsq_yg)
# generate x and y
x <- G %*% bgx$beta + u * sqrt(rsq_xu) + epsilon_x
y <- G %*% bgy + u * sqrt(rsq_yu) + epsilon_y + x * sqrt(rsq_yx)
return(list(x=x, y=y, u=u, G=G))
}
```
```{r}
dat <- dgp1(40000, 0.5, 40000, 0.04, 0, 1, 1, 0.01, 0.1, 0.1, 0.05)
bgxhat <- simulateGP::gwas(dat$x, dat$G)
bgyhat <- simulateGP::gwas(dat$y, dat$G)
plot(bgyhat$bhat ~ bgxhat$bhat)
summary(lm(bgyhat$bhat ~ 0 + bgxhat$bhat, weight=1/bgyhat$se^2))
```
```{r}
parameters <- expand.grid(
h2 = 0.5,
nvariants = 40000,
h2_inst = 0.05,
S = 0,
maf_alpha = 1,
maf_beta = 1,
sample_size = 400000,
rsq_yx = c(0, 0.001, 0.01, 0.05),
rsq_xu = 0.1,
rsq_yu = 0.1,
rsq_yg = c(0, 0.01, 0.05, 0.1)
)
```
# Model 2 - Heterogeneous treatment effect of x on y based on covariate
If there is a covariate $c$ that can have an influence on both $x$ and $y$, but the causal effect of $x$ on $y$ depends on the value of $c$. Suppose that $c$ is split across $k$ strata, and the causal effect of $x$ on $y$ is given as $b_{xy}^k$.
```{r}
#' Effect of x on y varies based on covariate
#'
#' @param sample_size Sample size to generate
#' @param h2 Heritability of x
#' @param nvariants Number of causal variants of x
#' @param h2_inst Variance explained by detected instruments
#' @param S Selection coefficient
#' @param maf_alpha Allele frequency distribution shape parameter, to be fed into beta distribution.
#' @param maf_beta Allele frequency distribution shape parameter, to be fed into beta distribution.
#' @param r_yx Vector of causal effects of x on y. Samples will be split into a number of groups based on this vector. The split is based on the covariate value c. Assumes variance of x and y both 1
#' @param r_xu Effect of unmeasured confounder on x (assuming variance of x and u both 1)
#' @param r_yu Effect of unmeasured confounder on y (assuming variance of y and u both 1)
#' @param r_xc Effect of interacting covariate on x (assuming variance of x and c both 1)
#' @param r_yc Effect of interacting covariate on y (assuming variance of y and c both 1)
#' @param rsq_yg Horizontal pleiotropy parameter, variance explained by all variants
#'
#' @return List
#' \itemize{
#' \item x - Vector
#' \item y - Vector
#' \item c - Vector
#' \item u - Vector
#' \item G - Matrix
#' }
dgp2 <- function(sample_size, h2, nvariants, h2_inst, S, maf_alpha, maf_beta, r_yx, r_xu, r_yu, r_xc, r_yc, rsq_yg)
{
# The overall variance explained in y by x
# This is probably wrong!
rsq_yx <- mean(r_yx^2)
# Checks
stopifnot(r_xu^2 + r_xc^2 + h2_inst <= 1)
stopifnot(r_yu^2 + r_yc^2 + rsq_yx + rsq_yg <= 1)
# Generate allele frequencies
af <- rbeta(nvariants, maf_alpha, maf_beta)
# Create unmeasured confounder
u <- rnorm(sample_size)
# Create interaction variable c
c <- rnorm(sample_size)
# Error term in x
epsilon_x <- rnorm(sample_size, mean=0, sd=sqrt(1 - r_xu^2 - r_xc^2 - h2_inst))
# Error term in y
epsilon_y <- rnorm(sample_size, mean=0, sd=sqrt(1 - r_yu^2 - r_yc^2 - rsq_yx - rsq_yg))
# Generate instrument effects on x
bgx <- generate_beta_gx(af=af, h2=h2, S=S, h2_inst=h2_inst) %>%
dplyr::filter(instrument)
# Generate genotype matrix
G <- generate_G(sample_size, bgx$af)
# Pleiotropic effects
bgy <- scale_bg(rnorm(nrow(bgx)), bgx$af, rsq_yg)
# Simulate x and y (excluding x->y effect)
x <- G %*% bgx$beta + u * r_xu + c * r_xc + epsilon_x
y <- G %*% bgy + u * r_yu + c * r_yc + epsilon_y
# Split sample into strata
nstrata <- length(r_yx)
quantiles <- quantile(c, seq(0, 1, length.out=nstrata+1))
quantiles[1] <- quantiles[1] - 1
# Add causal effect for each stratum
for(i in 1:nstrata)
{
index <- c > quantiles[i] & c <= quantiles[i+1]
y[index] <- y[index] + x[index] * r_yx[i]
}
return(list(x=x, y=y, u=u, c=c, G=G))
}
```
Example:
```{r}
dat <- dgp2(sample_size=10000, h2=0.5, nvariants=40000, h2_inst=0.05, S=0, maf_alpha=1, maf_beta=1, r_yx=c(-0.2, 0.2), r_xu=0.1, r_yu=0.1, r_xc=0, r_yc=0, rsq_yg=0)
bgxhat <- simulateGP::gwas(dat$x, dat$G)
bgyhat <- simulateGP::gwas(dat$y, dat$G)
plot(bgyhat$bhat ~ bgxhat$bhat)
summary(lm(bgyhat$bhat ~ 0 + bgxhat$bhat, weight=1/bgyhat$se^2))
```
```{r}
index <- dat$c < mean(dat$c)
bgxhat1 <- simulateGP::gwas(dat$x[index], dat$G[index,])
bgyhat1 <- simulateGP::gwas(dat$y[index], dat$G[index,])
bgxhat2 <- simulateGP::gwas(dat$x[!index], dat$G[!index,])
bgyhat2 <- simulateGP::gwas(dat$y[!index], dat$G[!index,])
plot(bgyhat1$bhat ~ bgxhat1$bhat)
summary(lm(bgyhat1$bhat ~ 0 + bgxhat1$bhat, weight=1/bgyhat1$se^2))
plot(bgyhat2$bhat ~ bgxhat2$bhat)
summary(lm(bgyhat2$bhat ~ 0 + bgxhat2$bhat, weight=1/bgyhat2$se^2))
```
# Model 3 - The interacting variable from (2) is proxied by a large number of proxies
Generate proxies to $C$ using Cholesky decomposition of a randomly generated correlation matrix
```{r}
generate_cors <- function(C, nproxy, cors)
{
cormat <- matrix(cors, nproxy+1, nproxy+1)
diag(cormat) <- 1
stopifnot(all(eigen(cormat)$values > 0))
x <- matrix(rnorm(length(C) * (nproxy+1)), length(C), nproxy+1)
x[,1] <- C
y <- x %*% solve(chol(var(x))) %*% chol(cormat)
return(y[,-1])
}
```
Perform simulation using `dgp2()` as in Model 2, and now create proxies traits for `dat$c` using the `generate_cors()` function. This is pretty simple, but could generate likely correlation structures from real data.
# Model 4 - Heterogeneous treatment effect at the instrument level
In models 2 and 3 the treatment effect of $x$ on $y$ varies based on individual's status at $c$. Including $C$ in DeepIV helps obtain treatment effects at different levels of $C$. In this model we look at heterogeneous treatment effects at the instrument level - that is where each instrument influences the outcome through a different mechanism. This is somewhat similar to the model assumed by the modal based estimator, in that instruments will cluster based on similarity of effect estimate. However, here we include variables that specify the clusters. Ideally, a model would identify that BMI does not have a causal effect but the proxy variables do.
Suppose that the set of genetic influences on BMI arise because BMI itself is a composite of several other traits. BMI itself doesn't have a causal effect on coronary heart disease, but it captures some traits that do have a causal effect.
For example,
- BMI = Body mass index
- CHD = Coronary heart disease
- Adi = Adiposity
- Mus = Muscle mass
- BMD = Bone mineral density
We are estimating the effect of BMI on CHD, but as can be seen, BMI has no influence. It is influenced by adiposity and muscle mass that each have effects on ChD. It is also influenced by bone mineral density that has no effect on CHD. So all the instruments for BMI are actually instruments for those other traits.
```{r}
d <- dagify(Mus ~ G1,
Adi ~ G2,
BMD ~ G3,
BMI ~ Mus + Adi + BMD,
CHD ~ Mus + Adi,
exposure="BMI", outcome="CHD")
ggdag_classic(d) + theme_dag_blank()
```
If we begin with effect sizes for BMI we can back calculate the genetic effects on each of the influencing factors $P$, given causal effects of those values on BMI.
$$
\beta_{pg} = \beta_{xg} / \beta_{xp}
$$
We can construct BMI and CHD from each of the influencing factors $P_k$.
$$
\begin{aligned}
P_k &= G_k \beta^k_{pg} + u \beta_{P_Ku} + e^{P_k} \\
x &= P\beta_{xp} + u \beta_{xu} + e^x \\
y &= P\beta_{yp} + u \beta_{yu} + e^y
\end{aligned}
$$
```{r}
#'
#'
#' @param sample_size
#' @param h2 heritability of x
#' @param nvariants number of variants influencing x overall
#' @param h2_inst variance explained by detected instruments for x
#' @param S selection coefficient
#' @param maf_alpha maf parameter
#' @param maf_beta maf parameter
#' @param r_yx causal effect of x on y. Intention is for this to be 0 in this model
#' @param r_xu confounding effect on x
#' @param r_yu confounding effect on y
#' @param r_xp vector of effects for each P on x. e.g. effects of BMD, adiposity and muscle mass on BMI
#' @param r_yp vector of effects for each P on y. e.g. effects of BMD, adiposity and muscle mass on CHD
#' @param r_pu vector of effects of unmeasured confounder on p
#' @param rsq_yg horizontal pleiotropy. variance explained by genetic variants on y directly
#'
#' @return list of
dgp4 <- function(sample_size, h2, nvariants, h2_inst, S, maf_alpha, maf_beta, r_yx, r_xu, r_yu, r_xp, r_yp, r_pu, rsq_yg)
{
# Checks
stopifnot(r_xu^2 + sum(r_xp^2) + h2_inst <= 1)
stopifnot(r_yu^2 + sum(r_yp^2) + r_yx^2 + rsq_yg <= 1)
# Generate allele frequencies
af <- rbeta(nvariants, maf_alpha, maf_beta)
# Create unmeasured confounder
u <- rnorm(sample_size)
# Error term in x
epsilon_x <- rnorm(sample_size, mean=0, sd=sqrt(1 - r_xu^2 - sum(r_xp^2) - h2_inst))
# Error term in y
epsilon_y <- rnorm(sample_size, mean=0, sd=sqrt(1 - r_yu^2 - sum(r_yp^2) - r_yx^2 - rsq_yg))
# Generate instrument effects on x
bgx <- generate_beta_gx(af=af, h2=h2, S=S, h2_inst=h2_inst) %>%
dplyr::filter(instrument)
# Generate genotype matrix
G <- generate_G(sample_size, bgx$af)
# Pleiotropic effects
bgy <- scale_bg(rnorm(nrow(bgx)), bgx$af, rsq_yg)
# Generate P variables
np <- length(r_yp)
stopifnot(length(r_yp) == length(r_xp))
stopifnot(length(r_pu) == length(r_xp))
P <- matrix(0, sample_size, np)
# Choose variants in bgx for each P
# Split the variants based on trait's effect on x
props <- r_xp^2 / sum(r_xp^2) * nrow(bgx)
index <- sample(1:np, size=nrow(bgx), replace=TRUE, prob=props)
print(table(index))
for(i in 1:np)
{
gbv <- G[,index==i] %*% (bgx$beta[index==i] / r_xp[i])
P[,i] <- gbv + u * r_pu[i] + rnorm(sample_size, mean=0, sd=sqrt(1 - r_pu[i]^2 - var(gbv)))
}
# Simulate x and y
x <- P %*% r_xp + u * r_xu + epsilon_x
y <- G %*% bgy + u * r_yu + P %*% r_yp + x * r_yx + epsilon_y
return(list(x=x, y=y, u=u, P=P, G=G))
}
```
Example:
```{r}
dat <- dgp4(sample_size=10000, h2=0.5, nvariants=40000, h2_inst=0.05, S=0, maf_alpha=1, maf_beta=1, r_yx=0, r_xu=sqrt(0.1), r_yu=sqrt(0.1), r_xp=sqrt(c(0.2, 0.5, 0.1)), r_yp=c(0.5, -0.2, 0), r_pu=sqrt(c(0.1,0.1,0.1)), rsq_yg=0)
bgxhat <- simulateGP::gwas(dat$x, dat$G)
bgyhat <- simulateGP::gwas(dat$y, dat$G)
plot(bgyhat$bhat ~ bgxhat$bhat)
summary(lm(bgyhat$bhat ~ 0 + bgxhat$bhat, weight=1/bgyhat$se^2))
```