# Module 3: Exercise
This is a placeholder text.

### Contents
- [Task 1: Properly center covariates](#task-1)
- [Task 2: Estimate mean outcomes under each first-stage intervention option, as well as their mean difference](#task-2)
- [Task 3: Fit a regression model to estimate the main effect of second-stage treatment among slow-responders to JASP + EMT](#task-3)

In [3]:
library(geepack)

## Function Definitions
The file `functions.R` contains code that will help us produce cleaner output from some of the models we'll fit in this module. Advanced R users are encouraged to look at this file to see how these functions work; otherwise, just know that this code will help us mimic SAS's estimate statements which are used in the training slides. <a href="ADHD_Data_Description_Handout.pdf"> hello </a>

In [4]:
source('functions.R')

function 'estimate' loaded successfully.


## Part 1: Getting comfortable with the data
In the series of practicum exercises, we'll be using *simulated* data in the context of the so-called autism SMART:
![Caption]()

In [5]:
# Load data file into R
aut <- read.csv("autism-simulated-dataset.csv")

# R is case-sensitive! Avoid issues by changing variable names to all lowercase
names(aut) <- tolower(names(aut))

head(aut)

id,o11,o12,a1,r,o21,o22,a2,y
<int>,<dbl>,<dbl>,<int>,<int>,<dbl>,<dbl>,<int>,<dbl>
1,7.154994,25.733054,1,0,4.709035,56.01627,-1.0,40.18308
2,29.735709,13.126062,1,0,4.261658,22.10117,1.0,49.41872
3,34.225682,40.074278,1,0,4.636997,62.93428,1.0,60.80869
4,30.713199,7.393897,1,0,4.40661,48.14041,-1.0,56.58124
5,17.030857,27.997028,-1,1,7.794111,36.41306,,63.59968
6,7.725391,8.308667,-1,1,7.711633,68.09537,,71.46104


We need to sort the data by ID number, and create an indicator for whether or not each child was re-randomized.

In [22]:
# Sort data by ID
aut <- aut[order(aut$id), ]

As we did with the ADHD data, it will be useful to look at some summaries of the data. We'll start with the usual 5-number summaries as well as the standard deviations of each of the variables.

In [5]:
## Brief summary statistics
summary(aut)
apply(aut, 2, sd, na.rm = T)

       id              o11              o12               a1    
 Min.   :  1.00   Min.   : 3.073   Min.   : 1.028   Min.   :-1  
 1st Qu.: 50.75   1st Qu.:26.001   1st Qu.: 8.997   1st Qu.:-1  
 Median :100.50   Median :30.713   Median :17.382   Median : 0  
 Mean   :100.50   Mean   :33.502   Mean   :17.014   Mean   : 0  
 3rd Qu.:150.25   3rd Qu.:41.689   3rd Qu.:22.990   3rd Qu.: 1  
 Max.   :200.00   Max.   :79.720   Max.   :43.160   Max.   : 1  
                                                                
       r              o21              o22               a2     
 Min.   :0.000   Min.   : 1.868   Min.   : 5.527   Min.   :-1   
 1st Qu.:0.000   1st Qu.: 4.418   1st Qu.:36.293   1st Qu.:-1   
 Median :1.000   Median : 5.456   Median :48.785   Median : 0   
 Mean   :0.585   Mean   : 5.585   Mean   :48.368   Mean   : 0   
 3rd Qu.:1.000   3rd Qu.: 6.791   3rd Qu.:61.033   3rd Qu.: 1   
 Max.   :1.000   Max.   :10.752   Max.   :91.837   Max.   : 1   
                         

We'll also look at some frequency tables for first-stage treatment assignment `a1`, response status `r`, and second-stage treatment assignment `a2` cross-tabulated with `r`. Note that `a2` is not defined (`NA`) for children who were assigned to receive the SGD in the first stage (such that `a1` = -1).

In [6]:
## Frequency table of the initial treatment,
## early response by week 8, and second stage treatments
table(aut$a1)
table(aut$r)
with(aut, table(a2, r, useNA = "ifany")) # cross-tabulate a2 and r


 -1   1 
100 100 


  0   1 
 83 117 

      r
a2       0   1
  -1    28   0
  1     28   0
  <NA>  27 117

### <a name="task-1"></a> Task 1: Properly center covariates.
Replace the blanks with correct code to center the covariate `o21` among the entire sample, as well as just among non-responders. Press `SHIFT` + `ENTER` when done to run the code.

In [10]:
aut$o11c <- with(aut, o11 - mean(o11))
aut$o12c <- with(aut, o12 - mean(o12))
aut$o21c <- ___________
aut$o22c <- with(aut, o22 - mean(o22))

## center baseline variables using mean among non-responders
aut$o11cnr <- aut$o12cnr <- NA
aut$o21cnr <- aut$o22cnr <- NA
aut$o11cnr[aut$r == 0] <- with(subset(aut, r == 0), o11 - mean(o11))
aut$o12cnr[aut$r == 0] <- with(subset(aut, r == 0), o12 - mean(o12))
aut$o21cnr[aut$r == 0] <- _______
aut$o22cnr[aut$r == 0] <- with(subset(aut, r == 0), o22 - mean(o22))

ERROR: Error in parse(text = x, srcfile = src): <text>:3:13: unexpected input
2: aut$o12c <- with(aut, o12 - mean(o12))
3: aut$o21c <- _
               ^


When your code runs successfully, the error message above will disappear. You can check your work by running the following block of code, again by pressing `SHIFT` + `ENTER`. Both results should be `0`.

In [8]:
mean(aut$o21c)
mean(aut$o21cnr)

“argument is not numeric or logical: returning NA”

“argument is not numeric or logical: returning NA”

## Part 2: Main effect of first-stage options
We will now investigate the main effect of the first-stage intervention options, JASP + EMT (`a1 = 1`) and JASP + EMT + SGD (`a1 = -1`). We do this by fitting the regresison equation

$$ E[Y\mid A_{1}, \mathbf{O}] = \beta_0 + \beta_1 A_{1} + \beta_2 O_{11c} + \beta_3 O_{12c}$$

using, as before, the `geeglm()` function in the `geepack` package. We will call this model `model1`. Notice that this model is marginal over the future **tighten this**

In [None]:
library(geepack)
model1 <- geeglm(y ~ a1 + o11c + o12c, 
                id = id, data = aut)

### <a name="task-2"></a> Task 2: Estimate mean outcomes under each first-stage intervention option, as well as their mean difference.

*Note that you will not be able to complete this task until you have successfully completed [Task 1](#task-1).*

Using `model1`, we want to estimate the mean outcome $Y$ under each of the first-stage intervention options. To do this, fill in the blanks to complete the second argument to `estimate()`. Once you have filled in the blanks, press `SHIFT` + `ENTER` to run the code.

In [9]:
estimate(model1,
         rbind("Mean Y under JASP+EMT"     = c(1, ____, 0, ____),
               "Mean Y under JASP+EMT+SGD" = c(1, -1, ____, 0),
               "Between groups diff"       = c(0, ____, 0, ____)))

ERROR: Error in parse(text = x, srcfile = src): <text>:2:51: unexpected input
1: estimate(model1,
2:          rbind("Mean Y under JASP+EMT"     = c(1, _
                                                     ^


If you have successfully run the code, the estimated standard error of the between-groups difference in means should be 2.0495.

Double-click the following block of text to edit it, delete the existing contents, and describe the results of the hypothesis test $H_0: 2\hat{\beta}_2 = 0$ vs $H_1: 2\hat{\beta}_2 \neq 0$. Is there evidence that adding the speech device in the first stage leads to better outcomes in children with autism, on average? As above, press `SHIFT` + `ENTER`when done.

**(Double-click to edit, and describe the results of the hypothesis test)** 

## Part 3: Main effect of second-stage options / tactics
Now, we focus on investigating the main effect of the second-stage tactics among non-responders to initial treatment. In particular, we seek to answer the question *"Is it better, on average, to intensify JASP + EMT, or to augment with AAC, among slow-responding children to JASP + EMT, adjusting for covariates?"* **NJS: tighten this** We will address this question by fitting the following regression model:

$$ E[Y \mid A_2, \mathbf{O_{1}}, \mathbf{O_{2}}, S = 1] = \beta_0 + \beta_1 A_{2} + \beta_2 O_{11c} + \beta_3 O_{12c} + \beta_4 O_{21c} + \beta_5 O_{22c}.$$

In order to fit this regression, we need an indicator $S$ for whether or not a child was re-randomized (1 = re-randomized, 0 = otherwise). This is essentially an indicator for whether the child was a slow-responder to JASP + EMT. We need this so we can tell `geeglm()` we only want to perform the regression on this subset of observations.

In [21]:
aut$s <- ifelse(aut$a1 == 1 & aut$r == 0, 1, 0)

### <a name="task-3"></a> Task 3: Fit a regression model to estimate the main effect of second-stage treatment among slow-responders to JASP + EMT

*Note that you will not be able to complete this task until you have successfully completed [Task 1](#task-1).*

Your task is to translate the above regression model into R code. Fill in the blanks in the code below, then press `SHIFT` + `ENTER` to run the code when you're done. The `coefs()` function will return the coefficient estimates (but not standard errors -- we need our `estimate()` function to do that) from your model. If you've fit the correct model, the coefficient associated with $A_2$ will be **-4.2471**.

In [1]:
model2 <- geeglm(y ~ _____________________________________________, 
                id = id, data = aut,
                subset = s == ______)
coefs(model2)

ERROR: Error in parse(text = x, srcfile = src): <text>:1:22: unexpected input
1: model2 <- geeglm(y ~ _
                         ^


## <a name="part-d"></a>Part 4: Sample Size for Primary Aims involving Main Effect Comparisons

Recall the formula for the sample size for a SMART in which the primary aim is to compare the main effects of first-stage interventions, using an asymptotic z-test:
$$ n \geq \frac{4 (z_{1-\alpha/2} + z_{1-\beta})^2}{\delta^2}. $$
We saw in the [Module 3 Tutorial](module3tutorial.ipynb#part-d) how to use `power.t.test` to find sample sizes for these comparisons.

### <a name="task-4"></a> Task 4: Compute sample size for a comparison of first-stage main effects
Use `power.t.test` to compute the sample size for a trial similar to the Autism SMART in which the primary aim is a comparison of the main effects of first-stage interventions. Power the study to detect an effect size of $\delta = 0.4$ at 80% power, using a two-sided significance level of 5%. 

In [None]:
power.t.test(_______)

The total required sample size of the trial is 200.