# Practicum 2: Using Data from a SMART to Address Primary Aims about Embedded Adaptive Interventions


</br>
<font size=3>
    This material has been developed for [Getting SMART About Adaptive Interventions in Education](https://d3lab.isr.umich.edu/training/) led by [d3lab](https://d3lab.isr.umich.edu). 
    
    Notebooks were developed by [Nicholas J. Seewald](https://nickseewald.com). 
    SAS code originally written by Daniel Almirall, Inbal Nahum-Shani, and Susan A. Murphy.
    The code was translated into R by Audrey Boruvka and Nicholas J. Seewald.
</font>


### Practicum Tasks
- [Task 1: Create an indicator for whether an individual is consistent with (JASP+EMT, INTENSIFY)](#task-1)
- [Task 2: Create weights](#task-2)
- [Task 3: Create indicator variables for consistency with the AIs under study](#task-3)
- [Task 4: Replicate responders to JASP+EMT](#task-4)
- [Task 5: Perform weighted and replicated regression](#task-5)

<hr>

In the series of practicum exercises, we'll be using *simulated* data in the context of the so-called autism SMART:
<img src="assets/autism-smart-diagram.jpg" alt="Autism SMART diagram" style="width: 500px;"/>

**First-Stage Coding**:
- JASP+EMT: A1 = 1
- JASP+EMT+SGD: A1 = -1

**Second-Stage Coding**:
- ADD SGD: A2 = 1 
- INTENSIFY: A2 = -1


## Function Definitions
The file `functions.R` contains code that will help us produce cleaner output from some of the models we'll fit in this module. Advanced R users are encouraged to look at this file to see how these functions work; otherwise, just know that this code will help us mimic SAS's estimate statements which are used in the training slides. <a href="ADHD_Data_Description_Handout.pdf"> THIS LINK DOES NOT WORK </a>

In [2]:
library(geepack)
source('functions.R')

function 'estimate' loaded successfully.


As in the [Main Effects Practicum](01_MainEffects_Practicum.ipynb), we need to do some data management before we can get started. See that notebook for more details; here, just run the cell below to perform all necessary operations.

In [7]:
aut <- read.csv("assets/autism-simulated-dataset.csv")
names(aut) <- tolower(names(aut))
aut <- aut[order(aut$id), ]

aut$o11c <- with(aut, o11 - mean(o11))
aut$o12c <- with(aut, o12 - mean(o12))
aut$o21c <- with(aut, o21 - mean(o21))
aut$o22c <- with(aut, o22 - mean(o22))
aut$o11cnr <- aut$o12cnr <- NA
aut$o21cnr <- aut$o22cnr <- NA
aut$o11cnr[aut$r == 0] <- with(subset(aut, r == 0), o11 - mean(o11))
aut$o12cnr[aut$r == 0] <- with(subset(aut, r == 0), o12 - mean(o12))
aut$o21cnr[aut$r == 0] <- with(subset(aut, r == 0), o21 - mean(o21))
aut$o22cnr[aut$r == 0] <- with(subset(aut, r == 0), o22 - mean(o22))

aut$s <- ifelse(aut$a1 == 1 & aut$r == 0, 1, 0)

aut <- aut[order(aut$id), ]
head(aut)

id,o11,o12,a1,r,o21,o22,a2,y,o11c,o12c,o21c,o22c,o12cnr,o11cnr,o22cnr,o21cnr,s
<int>,<dbl>,<dbl>,<int>,<int>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,7.154994,25.733054,1,0,4.709035,56.01627,-1.0,40.18308,-26.3466318,8.719485,-0.8759501,7.6479446,7.714571,-27.7192669,7.84296836,0.7102475,1
2,29.735709,13.126062,1,0,4.261658,22.10117,1.0,49.41872,-3.7659167,-3.887507,-1.3233268,-26.2671614,-4.892421,-5.1385519,-26.0721377,0.2628708,1
3,34.225682,40.074278,1,0,4.636997,62.93428,1.0,60.80869,0.7240556,23.060708,-0.9479882,14.565954,22.055794,-0.6485796,14.76097768,0.6382094,1
4,30.713199,7.393897,1,0,4.40661,48.14041,-1.0,56.58124,-2.7884268,-9.619673,-1.1783753,-0.2279138,-10.624586,-4.1610619,-0.03289008,0.4078223,1
5,17.030857,27.997028,-1,1,7.794111,36.41306,,63.59968,-16.4707692,10.983458,2.2091255,-11.9552628,,,,,0
6,7.725391,8.308667,-1,1,7.711633,68.09537,,71.46104,-25.7762347,-8.704902,2.126648,19.7270451,,,,,0


## Part 1: Estimate the mean outcome under an embedded AI

We'll start by creating an indicator for the (JASP+EMT, INTENSIFY) adaptive intervention. The indicator, which we'll call $Z_1$, is defined as 
$$
Z_1 = \left\{ 
\begin{array}{lr}
    1  & \text{Individual consistent with (JASP+EMT, INTENSIFY)} \\
    -1 & \text{otherwise}
\end{array}
\right. .
$$

### <a name="task-1"></a> Task 1: Create an indicator for whether an individual is consistent with (JASP+EMT, INTENSIFY)
Below, we start code to create the indicator $Z_1$ described above. Fill in the blanks to finish the code.

In [4]:
aut$z1 <- -1
#responders to JASP+EMT are consistent with (JASP+EMT, INTENSIFY)
aut$z1[aut$a1 == 1 & aut$r == 1] <- 1
# non-responders to JASP+EMT who receive INTENSIFY are consistent
aut$z1[aut$a1 == _____ & aut$r == _____ & aut$a2 == _____] <- 1

table(aut$z1)

ERROR: Error in parse(text = x, srcfile = src): <text>:5:18: unexpected input
4: # non-responders to JASP+EMT who receive INTENSIFY are consistent
5: aut$z1[aut$a1 == _
                    ^


When you are done, keep your cursor in the above cell and press `SHIFT`+`ENTER`. The table should show that **72** children are consistent with (JASP+EMT, INTENSIFY) (i.e., there are 72 1's in the table).

### <a name="task-2"></a> Task 2: Create weights
In order to estimate the mean outcome under (JASP+EMT, INTENSIFY), we need to construct weights which account for the imbalance (by design) in the numbers of responders and slow-responders who are consistent with this AI. 

Remember that the probability that a responder follows any given adaptive intervention is 1/2. The probability that a slow responder to JASP+EMT+SGD is consistent with the single AI that begins with that intervention is 1/2. Slow-responders to JASP+EMT are consistent with those AIs with probability 1/4. Therefore, we want to weight slow responders to JASP+EMT by 4, and all other children by 2. 

Below, you'll create the weight variable, called `w`.

In [None]:
# Start by giving everyone a weight of 2
aut$w <- 2

# Give slow responders to JASP+EMT (A1 = 1) an appropriate weight
aut$w[aut$a1 == _____ & aut$r == _____] <- _____

table(aut$w)

When you've filled in the blanks above, keep your cursor in the cell and press `SHIFT`+ `ENTER` to run the code. If you've completed the task successfully, there will be **56** children with a weight of 4.

### Modeling
*You will need to have completed Task 1 to run the code below.*

In [20]:
## Run weighted regression

model3 <- geeglm(y ~ z1, weights = w, id = id, data = aut)

estimate(model3,
         rbind("Mean Y under AI #1 (JASP+EMT, INTENSIFY)" = c(1, 1)))

ERROR: Error in eval(predvars, data, env): object 'z1' not found


An alternative way to estimate the mean under (JASP+EMT, INTENSIFY) is to restrict the analysis to just children with `z1 == 1`, and then just estimate a weighted mean (i.e., fit an intercept-only model).

In [None]:
model3alternative <- geeglm(y ~ 1, weights = w, id = id, data = aut,
                            subset = z1 == 1)

summary(model3alternative)

## Part 2: Compare the means of two embedded adaptive interventions
We are now going to compare the mean outcomes had every child been consistent with (JASP+EMT+SGD, INTENSIFY) to the mean outcomes had every child been consistent with (JASP+EMT, Add SGD). The goal is to do this simultaneously (i.e., with one regression). This also facilitates making inferences about the difference in means.

Below, we use an intuitive (but less efficient) way to compare these two adaptive interventions. In the regression below, we'll use data only from participants who are consistent with one of the two AIs we're comparing.

### <a name="task-3"></a> Task 3: Create indicator variables for consistency with the AIs under study
To perform this single regression to compare mean outcomes under (JASP+EMT+SGD, INTENSIFY) and (JASP+EMT, Add SGD), we need to create indicator variables for whether or not each child was consistent with the appropriate AI.

**Notice: we can identify children who were consistent with (JASP+EMT+SGD, INTENSIFY) using only their first-stage treatment!**

In [9]:
# Create indicator z2 for consistency with (JASP+EMT, Add SGD)
## Give everyone -1 to start with (not consistent)
aut$z2 <- -1
## Change indicator to 1 if consistent
aut$z2[aut$a1 == 1 & aut$r == 1] <- 1
aut$z2[aut$a1 == 1 & aut$r == 0 & aut$a2 == 1] <- 1

# Create indicator z3 for consistency with (JASP+EMT+SGD, INTENSIFY)
## Give everyone -1 to start with (not consistent)
aut$z3 <- -1
## Change indicator to 1 if consistent
aut$z3[________] <- 1

table(aut$z3)


 -1 
200 


 -1   1 
100 100 

When you've filled in the above blank, keep your cursor in the cell and press `SHIFT` + `ENTER` to run the code. If you've done this correctly, you will find that **100** children are consistent with (JASP+EMT+SGD, INTENSIFY) (i.e., have `z3` = 1). 

### Model for comparison of (JASP+EMT+SGD, INTENSIFY) and (JASP+EMT, Add SGD)
Below, we fit the models used to make the comparison of interest.

In [10]:
model5 <- geeglm(y ~ z2, weights = w, id = id, data = aut,
                 # Only use individuals for whom z2 OR z3 is 1
                 subset = z2 == 1 | z3 == 1)

estimate(model5,
         rbind("Mean Y under AI#1 (JASP+EMT, Add SGD)"      = c(1,  1),
               "Mean Y under AI#2 (JASP+EMT+SGD, INTENSFY)" = c(1, -1),
               "Difference AI#1 - AI#2"                     = c(0,  2)))

## Conduct a regression analysis to compare mean outcomes under AI#1 vs AI#2
## adjusting for baseline covariates O12c and O14c;

model6 <- geeglm(y ~ z1 + o12c + o14c, weights = w, id = id, data = aut,
                 # Only use individuals for whom z1 OR z2 is 1
                 subset = z1 == 1 | z2 == 1)
estimate(model6,
         rbind("Mean Y under AI#1 (MED, AUGMENT)"  = c(1,  1, 0, 0),
               "Mean Y under AI#2 (BMOD, AUGMENT)" = c(1, -1, 0, 0),
               "Difference AI#1 - AI#2"            = c(0,  2, 0, 0)))

ERROR: Error in eval(extras, data, env): object 'w' not found


## Part 3: Compare all three embedded adaptive interventions with one regression
Now we'll estimate the mean outcomes had all children been consistent with each of the three embedded adaptive interventions, using a single regression. This analysis differs from the above in that it requires both weighting *and* replication.

Because responders to JASP+EMT are consistent with more than one adaptive intervention -- (JASP+EMT, INTENSIFY) and (JASP+EMT, Add SGD) -- we need to trick R into using their data to estimate the mean outcomes for both of those AIs. Responders to (JASP+EMT+SGD, INTENSIFY) do *not* need to be replicated, as they're consistent with only one AI.

We're going to create a new data.frame called `aut.rep` for the replicated data.

### <a name="task-4"></a> Task 4: Replicate responders to JASP+EMT

In [12]:
# Create aut.rep, which contains one copy of the original data, giving this copy obs=1
aut.rep <- rbind(data.frame(aut, obs = 1),
                 # Add a copy of the rows to be replicated, giving this copy obs=2
                 # Fill in the blank with the indicator for the first-stage treatment whose responders are replicated.
               data.frame(subset(aut, a1 == ______ & r == 1), obs = 2))
nrow(aut.rep)

When you're finished, press `SHIFT` + `ENTER` to run the code, as usual. The replicated data should have **244** rows.

We now need to fix the indicator for second-stage treatment `a2` such that one copy of the responders to JASP+EMT receives $A_2 = 1$ and one copy receives $A_2 = -1$. 

In [19]:
# In the first blank, fill in the value of r identifying the rows we want to manipulate
# In the second blank, fill in the value of a1 identifying the rows we want to manipulate.
aut.rep$a2[aut.rep$a1 == 1 & aut.rep$r == _____] <-
  c(-1, 1)[with(aut.rep, obs[a1 == _____ & r == 1])]

## NOTE: geeglm needs data to be sorted by cluster id 
## (here a "cluster" is a single person in the SMART)
aut.rep <- aut.rep[order(aut.rep$id), ]

## Cross-tabulate R and A2 in the original data
with(aut, table(r, a2, useNA = "ifany"))

## For comparison, cross-tabulate R and A2 in the replicated data
with(aut.rep, table(r, a2))

   a2
r    -1   1 <NA>
  0  28  28   27
  1   0   0  117

   a2
r   -1  1
  0 28 28
  1 44 44

When you're done filling in the blanks, press `SHIFT`+`ENTER`. You should see two tables: the first is the cross-tabulation of R and A2 in the original (unreplicated) data; the second, in the replicated data. The second table should have **44** responders consistent with each of the second-stage interventions and no `NA`s.

We now need to create indicators to differentiate between all three embedded adaptive interventions. `a1` will help differentiate (JASP+EMT, INTENSIFY) and (JASP+EMT, Add SGD) from (JASP+EMT+SGD, INTENSIFY), but within `a1=1`, we need to be able to differentiate between (JASP+EMT, INTENSIFY) and (JASP+EMT, Add SGD).

In [None]:
# Interaction between a1 and a2 will help differentiate between these two AIs
aut.rep$a1a2 <- aut.rep$a1 * aut.rep$a2

# Set the interaction variable to 0 for those rows for which it doesn't exist (i.e., A1 = -1 -- children starting on JASP+EMT+SGD)
aut.rep$a1a2[is.na(aut.rep$a1a2)] <- 0

So now we're able to distinguish between our three AIs using `a1` and `a1a2`. Now  we can model!

### <a name="task-5"></a> Task 5: Perform weighted and replicated regression
Below, fill in the blanks to perform a weighted and replicated regression using `aut.rep` which will allow us to simultaneously estimate the mean outcomes had all children been consistent with each embedded adaptive intervention.

In [21]:
model7 <- geeglm(y ~ ____ + a1a2, weights = w, id = id, data = aut.rep)

estimate(model7,
         rbind(
             # These statements get the mean under each AI
             # Remember the coefficients are positional: the first is the intercept, the second is the variable in the blank, the third is a1a2.
           "Mean Y: AI#1 (JASP+EMT,INTENSFY)" = c(1, ____,  1),
           "Mean Y: AI#2 (JASP+EMT, Add SGD)" = c(1, ____, -1),
           "Mean Y: AI#3 (JASP+EMT+SGD, ...)" = c(1, ____,  0),
           "Diff: (JASP+EMT,Add SGD) - (JASP+EMT+SGD,...)" = c(0, ____, -1)))

ERROR: Error in parse(text = x, srcfile = src): <text>:1:22: unexpected input
1: model7 <- geeglm(y ~ _
                         ^


If you have filled in the blanks successfully, **FINISH THIS**

Double click the blank cell below to edit, and describe your conclusion regarding the difference in mean outcomes between (JASP+EMT,Add SGD) and (JASP+EMT+SGD, INTENSIFY).

**Double-click to edit. Replace this text with a description of your conclusion.**

A statistical advantage of estimating the means under all 3 embedded AIs simultaneously is that we can increase statistical efficiency (lower standard errors) in the estimation of the differences in means. We can do this by adjusting for baseline covariates that might explain variability in Y. However, we must be careful not to adjust for post-baseline/time-varying covariates or intermediate outcomes.

Let's build a model which adjusts for `o11c`, the centered version of baseline covariate `o11`.

In [None]:
# Fill in the blank with the other variable needed to identify all three embedded AIs
model8 <- geeglm(y ~ a1 + _____ + o11c, weights = w, id = id,
                 data = aut.rep)

estimate(model8,
         rbind(
             # These statements get the mean under each AI
             # Remember the coefficients are positional: the first is the intercept, the second is a1, the third is the variable in the blank, the fourth is o11c.
             "Mean Y: AI#1 (JASP+EMT,INTENSFY)" = c(1, ____, -1, 0),
             "Mean Y: AI#2 (JASP+EMT, Add SGD)" = c(1, ____, -1, 0),
             "Mean Y: AI#3 (JASP+EMT+SGD, INTENSIFY)"  = c(1, ____,  1, 0),
             "Diff: (JASP+EMT,Add SGD) - (JASP+EMT+SGD, INTENSIFY)" = c(1,  ____,  1, 0)
         ))

Press `SHIFT`+`ENTER` to run the code. If you've filled in the blanks successfully, **FINISH THIS**.

Double click the blank cell below to edit, and describe your conclusion regarding the difference in mean outcomes between (JASP+EMT,Add SGD) and (JASP+EMT+SGD, INTENSIFY). Compare this conclusion to the one made regarding the model above.

**Double-click to edit. Replace this text with a description of your conclusion.***