# PD7 - Directed Acyclic Graphs

In [1]:
sample <- runif(1000000, 0, 1) < 0.001
sum(sample)

## 1.

![](assets/dags1.png)

In [2]:
pop_Z <- rnorm(1000000)
pop_X <- pop_Z + rnorm(1000000)
pop_Y <- pop_Z + pop_X + rnorm(1000000)

In [3]:
data <- data.frame(x = pop_X, y = pop_Y, z = pop_Z)
summary(lm(y ~ 0 + x, data = data, subset = sample))


Call:
lm(formula = y ~ 0 + x, data = data, subset = sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.7165 -0.7473 -0.0483  0.7997  3.4678 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
x  1.48590    0.02556   58.13   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.163 on 1024 degrees of freedom
Multiple R-squared:  0.7675,	Adjusted R-squared:  0.7672 
F-statistic:  3380 on 1 and 1024 DF,  p-value: < 2.2e-16


In [4]:
summary(lm(y ~ 0 + x + z, data = data, subset = sample))


Call:
lm(formula = y ~ 0 + x + z, data = data, subset = sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0781 -0.6179 -0.0120  0.6568  3.2849 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
x  1.02917    0.03074   33.48   <2e-16 ***
z  0.92548    0.04461   20.75   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9762 on 1023 degrees of freedom
Multiple R-squared:  0.8363,	Adjusted R-squared:  0.836 
F-statistic:  2614 on 2 and 1023 DF,  p-value: < 2.2e-16


1. Good controls: controlling for a confounder

In [5]:
pop_U <- rnorm(1000000)
pop_Z <- pop_U + rnorm(1000000)
pop_X <- pop_Z + rnorm(1000000)
pop_Y <- pop_U + pop_X + rnorm(1000000)

In [6]:
data <- data.frame(x = pop_X, y = pop_Y, z = pop_Z)
summary(lm(y ~ 0 + x, data = data, subset = sample))


Call:
lm(formula = y ~ 0 + x, data = data, subset = sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.1249 -0.8547 -0.0144  0.7427  4.6348 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
x  1.37391    0.02253   60.99   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.251 on 1024 degrees of freedom
Multiple R-squared:  0.7842,	Adjusted R-squared:  0.7839 
F-statistic:  3720 on 1 and 1024 DF,  p-value: < 2.2e-16


In [7]:
summary(lm(y ~ 0 + x + z, data = data, subset = sample))


Call:
lm(formula = y ~ 0 + x + z, data = data, subset = sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.0177 -0.8140 -0.0537  0.6740  3.7833 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
x  1.08480    0.03762  28.840   <2e-16 ***
z  0.43416    0.04622   9.393   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.201 on 1023 degrees of freedom
Multiple R-squared:  0.8013,	Adjusted R-squared:  0.8009 
F-statistic:  2063 on 2 and 1023 DF,  p-value: < 2.2e-16


## 2. 

![](assets/dags2.png)

In [8]:
pop_U <- rnorm(1000000)
pop_Z <- pop_U + rnorm(1000000)
pop_X <- pop_U + rnorm(1000000)
pop_M <- pop_Z + rnorm(1000000)
pop_Y <- pop_X + pop_M + rnorm(1000000)

In [9]:
data <- data.frame(x = pop_X, y = pop_Y, z = pop_Z)
summary(lm(y ~ 0 + x, data = data, subset = sample))


Call:
lm(formula = y ~ 0 + x, data = data, subset = sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.9477 -1.4200 -0.0597  1.2456  6.2983 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
x  1.52538    0.04125   36.98   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.909 on 1024 degrees of freedom
Multiple R-squared:  0.5718,	Adjusted R-squared:  0.5714 
F-statistic:  1368 on 1 and 1024 DF,  p-value: < 2.2e-16


In [10]:
summary(lm(y ~ 0 + x + z, data = data, subset = sample))


Call:
lm(formula = y ~ 0 + x + z, data = data, subset = sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8808 -1.1282 -0.0919  0.9178  4.9571 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
x  0.97392    0.03784   25.74   <2e-16 ***
z  1.03274    0.03872   26.67   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.466 on 1023 degrees of freedom
Multiple R-squared:  0.7475,	Adjusted R-squared:  0.747 
F-statistic:  1514 on 2 and 1023 DF,  p-value: < 2.2e-16


2. Good controls: controlling for a confounder

## 3.

![](assets/dags3.png)

In [11]:
pop_U1 <- rnorm(1000000)
pop_U2 <- rnorm(1000000)
pop_Z <- pop_U1 + pop_U2 + rnorm(1000000)
pop_X <- pop_U1 + rnorm(1000000)
pop_Y <- pop_X + pop_U2 + rnorm(1000000)

In [12]:
data <- data.frame(x = pop_X, y = pop_Y, z = pop_Z)
summary(lm(y ~ 0 + x, data = data, subset = sample))


Call:
lm(formula = y ~ 0 + x, data = data, subset = sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.7416 -0.8148  0.1037  1.0418  4.5118 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
x  0.97485    0.03105   31.39   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.411 on 1024 degrees of freedom
Multiple R-squared:  0.4904,	Adjusted R-squared:  0.4899 
F-statistic: 985.4 on 1 and 1024 DF,  p-value: < 2.2e-16


In [13]:
summary(lm(y ~ 0 + x + z, data = data, subset = sample))


Call:
lm(formula = y ~ 0 + x + z, data = data, subset = sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.3085 -0.7855  0.0809  0.8783  4.3582 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
x  0.77177    0.03064   25.19   <2e-16 ***
z  0.40384    0.02546   15.86   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.265 on 1023 degrees of freedom
Multiple R-squared:  0.591,	Adjusted R-squared:  0.5902 
F-statistic:   739 on 2 and 1023 DF,  p-value: < 2.2e-16


3. Bad control: controlling for a collider (M-graph)

## 4. Damned if you do damned if you don't

![](assets/dags4.png)

In this situation, it is not clear whether Z will be a good or bad controls. Other tools may be needed like sensitivity analysis to assess the effect of using it as a control.

## 5.

![](assets/dags5.png)

In [14]:
pop_Z <- rnorm(1000000)
pop_X <- rnorm(1000000)
pop_Y <- pop_X + pop_Z + rnorm(1000000)

In [15]:
data <- data.frame(x = pop_X, y = pop_Y, z = pop_Z)
summary(lm(y ~ 0 + x, data = data, subset = sample))


Call:
lm(formula = y ~ 0 + x, data = data, subset = sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.3340 -0.9861  0.0061  0.9887  5.0667 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
x  0.99795    0.04432   22.52   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.449 on 1024 degrees of freedom
Multiple R-squared:  0.3312,	Adjusted R-squared:  0.3305 
F-statistic: 507.1 on 1 and 1024 DF,  p-value: < 2.2e-16


In [16]:
summary(lm(y ~ 0 + x + z, data = data, subset = sample))


Call:
lm(formula = y ~ 0 + x + z, data = data, subset = sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8748 -0.7135 -0.0204  0.7212  3.0431 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
x  1.03645    0.03176   32.63   <2e-16 ***
z  0.98091    0.03144   31.20   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.038 on 1023 degrees of freedom
Multiple R-squared:  0.6573,	Adjusted R-squared:  0.6567 
F-statistic: 981.2 on 2 and 1023 DF,  p-value: < 2.2e-16


5. Neutral control (but raises precision)

## 6.

![](assets/dags6.png)

In [17]:
pop_Z <- rnorm(1000000)
pop_X <- pop_Z + rnorm(1000000)
pop_Y <- pop_X + rnorm(1000000)

In [18]:
data <- data.frame(x = pop_X, y = pop_Y, z = pop_Z)
summary(lm(y ~ 0 + x, data = data, subset = sample))


Call:
lm(formula = y ~ 0 + x, data = data, subset = sample)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.78270 -0.68579 -0.03951  0.64197  3.03665 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
x  0.98955    0.02145   46.14   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9835 on 1024 degrees of freedom
Multiple R-squared:  0.6752,	Adjusted R-squared:  0.6749 
F-statistic:  2129 on 1 and 1024 DF,  p-value: < 2.2e-16


In [19]:
summary(lm(y ~ 0 + x + z, data = data, subset = sample))


Call:
lm(formula = y ~ 0 + x + z, data = data, subset = sample)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.78025 -0.68771 -0.03808  0.64100  3.03970 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
x  0.99487    0.03047  32.647   <2e-16 ***
z -0.01070    0.04352  -0.246    0.806    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9839 on 1023 degrees of freedom
Multiple R-squared:  0.6752,	Adjusted R-squared:  0.6746 
F-statistic:  1064 on 2 and 1023 DF,  p-value: < 2.2e-16


6. Neutral control (but lowers precision)

## 7.

![](assets/dags7.png)

In [20]:
pop_X <- rnorm(1000000)
pop_M <- pop_X + rnorm(1000000)
pop_Z <- pop_M + rnorm(1000000)
pop_Y <- pop_M + rnorm(1000000)

In [21]:
data <- data.frame(x = pop_X, y = pop_Y, z = pop_Z)
summary(lm(y ~ 0 + x, data = data, subset = sample))


Call:
lm(formula = y ~ 0 + x, data = data, subset = sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.5070 -0.9222  0.0447  1.0392  3.9809 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
x  1.13203    0.04692   24.12   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.465 on 1024 degrees of freedom
Multiple R-squared:  0.3624,	Adjusted R-squared:  0.3618 
F-statistic:   582 on 1 and 1024 DF,  p-value: < 2.2e-16


In [22]:
summary(lm(y ~ 0 + x + z, data = data, subset = sample))


Call:
lm(formula = y ~ 0 + x + z, data = data, subset = sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2909 -0.7903  0.0268  0.8626  3.5449 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
x  0.57801    0.04915   11.76   <2e-16 ***
z  0.52020    0.02669   19.49   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.252 on 1023 degrees of freedom
Multiple R-squared:  0.535,	Adjusted R-squared:  0.5341 
F-statistic: 588.6 on 2 and 1023 DF,  p-value: < 2.2e-16


7. Bad control: introduces overcontrol bias

## 8.

![](assets/dags8.png)

In [23]:
pop_X <- rnorm(1000000)
pop_Z <- rnorm(1000000)
pop_M <- pop_X + pop_Z + rnorm(1000000)
pop_Y <- pop_M + rnorm(1000000)

In [24]:
data <- data.frame(x = pop_X, y = pop_Y, z = pop_Z)
summary(lm(y ~ 0 + x, data = data, subset = sample))


Call:
lm(formula = y ~ 0 + x, data = data, subset = sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.5743 -1.2351 -0.0665  1.1253  5.4936 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
x  0.96257    0.05351   17.99   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.759 on 1024 degrees of freedom
Multiple R-squared:  0.2401,	Adjusted R-squared:  0.2394 
F-statistic: 323.5 on 1 and 1024 DF,  p-value: < 2.2e-16


In [25]:
summary(lm(y ~ 0 + x + z, data = data, subset = sample))


Call:
lm(formula = y ~ 0 + x + z, data = data, subset = sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.0245 -0.9978 -0.0268  0.8781  4.3578 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
x  0.96813    0.04361   22.20   <2e-16 ***
z  1.00996    0.04434   22.78   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.434 on 1023 degrees of freedom
Multiple R-squared:  0.4958,	Adjusted R-squared:  0.4948 
F-statistic:   503 on 2 and 1023 DF,  p-value: < 2.2e-16


8. Neutral control (but raises precision)

## 9.

![](assets/dags9.png)

In [26]:
pop_X <- rnorm(1000000)
pop_U <- rnorm(1000000)
pop_Y <- pop_X + pop_U + rnorm(1000000)
pop_Z <- pop_X + pop_U + rnorm(1000000)

In [27]:
data <- data.frame(x = pop_X, y = pop_Y, z = pop_Z)
summary(lm(y ~ 0 + x, data = data, subset = sample))


Call:
lm(formula = y ~ 0 + x, data = data, subset = sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.4815 -0.9561 -0.0997  0.9979  4.1005 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
x  0.96361    0.04316   22.33   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.411 on 1024 degrees of freedom
Multiple R-squared:  0.3274,	Adjusted R-squared:  0.3268 
F-statistic: 498.5 on 1 and 1024 DF,  p-value: < 2.2e-16


In [28]:
summary(lm(y ~ 0 + x + z, data = data, subset = sample))


Call:
lm(formula = y ~ 0 + x + z, data = data, subset = sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.3684 -0.7803 -0.0234  0.8184  3.5662 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
x  0.44695    0.04377   10.21   <2e-16 ***
z  0.53761    0.02570   20.92   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.182 on 1023 degrees of freedom
Multiple R-squared:  0.529,	Adjusted R-squared:  0.528 
F-statistic: 574.4 on 2 and 1023 DF,  p-value: < 2.2e-16


9. Bad control: introduces selection bias

## 10

![](assets/dags10.png)

In [29]:
pop_X <- rnorm(1000000)
pop_Y <- pop_X + rnorm(1000000)
pop_Z <- pop_Y + rnorm(1000000)

In [30]:
data <- data.frame(x = pop_X, y = pop_Y, z = pop_Z)
summary(lm(y ~ 0 + x, data = data, subset = sample))


Call:
lm(formula = y ~ 0 + x, data = data, subset = sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4401 -0.6263  0.0058  0.6410  2.9364 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
x  1.00072    0.03052   32.78   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.993 on 1024 degrees of freedom
Multiple R-squared:  0.5121,	Adjusted R-squared:  0.5116 
F-statistic:  1075 on 1 and 1024 DF,  p-value: < 2.2e-16


In [31]:
summary(lm(y ~ 0 + x + z, data = data, subset = sample))


Call:
lm(formula = y ~ 0 + x + z, data = data, subset = sample)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.05446 -0.47798 -0.01047  0.47462  2.34437 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
x  0.50856    0.02691   18.90   <2e-16 ***
z  0.48700    0.01557   31.28   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7102 on 1023 degrees of freedom
Multiple R-squared:  0.7506,	Adjusted R-squared:  0.7502 
F-statistic:  1540 on 2 and 1023 DF,  p-value: < 2.2e-16


10. Bad control: this is called case-control bias