# How to handle (one-sided) noncompliance

Ever run an AB test in which not all users in the treatment group actually received the treatment?  As in, you pull a sample from your user base, you split it into treatment and control, you intend to give the treatment to all of the users in the treatment group, and give no treatment to users in the control group, but for one reason or another, not all users in the treatment group actually received the treatment.  

If you've never had to worry about this, great!  But you might be thinking, how would this even happen?  Well, imagine you work on an app or a webpage and you and a team want to test a new feature.  This feature is not in plain view as soon as a user visits your page, so the user would have to navigate in order to see it.  Now, when designing the experiment you pull a sample of users, and you randomly divide this sample into A and B, control and treatment.  All that is left to do is to wait for the users to log in to the webpage and, hopefully, navigate to the section of your webpage with the test feature.  But, because the feature is out-of-focus intially, only some users in the treatment group will ever navigate to the point on the page on which the feature is located.  In other words, some of your treatment group will remain untreated because they will never reach that section of the page to see your test feature.  What do you do?  Luckily, this type of issue is common in randomized controlled trials where it is known as *noncompliance*.  For example, medical researchers that are testing a new drug cannot always *force* an individual to take their medication, so inevitably some users in the experimental group will not comply.  

This post focuses on one-sided noncompliance -- if you assign some users to a treatment group, but some manage to remain untreated.  Two-sided noncompliance exists -- when some users in the control group manage to get access to the experimental treatment -- but in the case of an AB test that a data scientist would be involved in, is much less-likely to occur.  In any case, the analysis is the same, but the assumptions become slightly stronger in the two-sided case.  

In [1]:
options(tidyverse.quiet=TRUE)
options(dplyr.summarise.inform=FALSE)
options(warn=-1)
library(tidyverse)
suppressMessages(library(AER))
library(broom)

# AER for ivreg function, 
# you can also find ivreg functions in the ivreg package, and tsls in the sem and gmm packages... 
# There are probably several others as well that I'm not aware of
set.seed(123)

## Potential outcomes

The potential outcomes framework provides the theoretical justification for randomized controlled experiments as a means of determining causality.  In effect, it underpins all AB testing.  We're going to use it here to demonstrate the effects of noncompliance, and how we can still obtain valid causal estimates in the presence of noncompliance.  

Before we run any AB test we can think of each user in our sample as having two theoretical states; the user can be assigned to the treatment group, in which case they will be given the experimental treatment, or they can be assigned to the control group, in which case they are not given the treatment.  Each individual in the sample therefore has a 'treated state' and an 'untreated state'.  Naturally, one can never observe an individual in both the treated state and untreated state at the same time (this is known as the fundamental problem of causal inference).  But, given random assignment, we can aggregate and observe groups in both states simultaneously.  We're going to use this framework in the simulations throughout this post.  

Let's get theoretical now and assume an all-knowing position in which we *can observe* all users in a sample in both states.  We can create a dataset that has exactly this.  First, let's define a treatment status as a random, binary variable -- it will be 0 for ~50% of the sample and 1 for ~50% of the sample:

In [2]:
treatment_status = rbinom(1000, 1, 0.5)

Next, let's define the baseline value of the outcome variable for all users.  This is the status quo, if we left all users in the control condition, this is these are the values that we have:

In [3]:
y_under_control = rnorm(1000, 10, 1)

Now, we can define a treatment effect.  Let's assume this is a normally distributed variable with a mean of 2 and a standard deviation of 0.1.  This means that, for some users the treatment effect will be larger than 2, for some it will be smaller than 2, but on average it will be 2, give or take:

In [4]:
treatment_effect = rnorm(1000, 2, 0.1)

Given this treatment effect, we can obtain the treated state for each user by adding the individual treatment effects to the baseline values for each user as assumed under control:

In [5]:
y_under_treatment = y_under_control + treatment_effect

Now, let's pretend that we actually ran an AB test with this sample.  Which values would we observe?  For the users for whom treatment status was 0, we would observe ```y_under_control```, and for users for whom treatment status was assigned 1, we would observed ```y_under_treatment```.  Given this, we can define a third y variable, ```y_observed```, which represents the actual values that we observed in this theoretical AB test:

In [6]:
y_observed = y_under_control + treatment_effect*treatment_status

Now that we have both of our treatment states for all users, as well as the actually observed values from our hypothetical AB test, we can arrange this all into a data frame:

In [7]:
data = data.frame(
    treatment_status,
    treatment_effect,
    y_under_control,
    y_under_treatment,
    y_observed
)

data %>% head()

treatment_status,treatment_effect,y_under_control,y_under_treatment,y_observed
<int>,<dbl>,<dbl>,<dbl>,<dbl>
0,1.917901,9.398107,11.31601,9.398107
1,1.969274,9.006301,10.97558,10.975576
0,1.90979,11.026785,12.93658,11.026785
1,2.062707,10.751061,12.81377,12.813768
1,2.112036,8.490833,10.60287,10.602869
0,2.212721,9.904853,12.11757,9.904853


Here it is, we have a sample of 1000 users.  For each user we have a treatment status, which designates if a user was assigned to the control condition (0) or treatment (1); we have a hypothetical treatment effect **for each user** which assigns a treatment effect that each user in the sample would have had if they were assigned to the treatment group; we have the y value for each user in the untreated state, the untreated potential outcome; we have the y value for each user in the treated state, the treated potential outcome; and we have the y value that we would in fact observe when running this AB test given the treatment assignment for each user.  

## Average Causal Effect



Before we get into noncompliance, let's take a look at the most commonly used treatment effect, the average causal effect (also commonly called the average treatment effect).  This is simply the difference in observed means between treatment and control.  

<h3><center> $ ACE = \bar{x}_A - \bar{x}_B $ </center></h3>

In [8]:
data %>%
group_by(treatment_status) %>%
  summarise(y_observed = mean(y_observed))

treatment_status,y_observed
<int>,<dbl>
0,10.01029
1,12.01041


Because assignment to treatment was random, the observed value of y should match the expected difference in potential outcomes for the entire sample.  In other words, the mean of the treatment status value for all users should be roughly 2 units greater than the mean of the untreated potential outcome value for all users in the control group:

In [9]:
data %>%
  summarise(y_under_control = mean(y_under_control),
           y_under_treatment = mean(y_under_treatment))

y_under_control,y_under_treatment
<dbl>,<dbl>
10.01193,12.01167


In effect, the ACE an unbiased estimate of the difference between the treated potential outcome and the untreated potential outcome (remember, these two values are inherently unobservable).  Ok, now that that's clear, let's see what happens when we introduce noncompliance into the sample. 

## Noncompliance - and different causal effects

We'll add one more variable, compliance, to our data frame.  For now, let's make this variable uncorrelated to all others.  We'll define it as a simple binary variable, 0 indicates that a user is a noncomplier, 1 indicates that a user is a complier.  We will define this for all users, so even users that are assigned to the control group can be compliers, but in this case they *would have complied* had they been assigned to treatment.  This is known as one-sided noncompliance.

In [10]:
noncompliance_data = 
  bind_cols(
      tibble(
          compliance = rbinom(1000, 1, 0.5)
  ), data) %>%
  mutate(y_observed = y_under_control + treatment_effect*treatment_status*compliance)

In [11]:
noncompliance_data %>% head()

compliance,treatment_status,treatment_effect,y_under_control,y_under_treatment,y_observed
<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>
0,0,1.917901,9.398107,11.31601,9.398107
1,1,1.969274,9.006301,10.97558,10.975576
1,0,1.90979,11.026785,12.93658,11.026785
1,1,2.062707,10.751061,12.81377,12.813768
0,1,2.112036,8.490833,10.60287,8.490833
0,0,2.212721,9.904853,12.11757,9.904853


Now, the situation is different.  Each individual observation still has a theoretical value of y under both treatment and control.  And we assign 50% of our sample to treatment, and 50% to control, as designated by the treatment_status column.  But, we also have compliance.  This column represents a participants willingness to comply with treatment if assigned to the treatment group, and is independent of actual treatment status (this last point is important, but is perfectly reasonable because treatment assignment was random; it is therefore independent of all background variables).  In this example, compliance is independent of the outcome variable as well. This assumption is likely untenable in real practice, but let's get back to this point further down... 

Ok, so we have a sample that we have divided into treatment and control, and we know that the true treatment effect is 2 units, give or take.  But now we have a situation where only 50% of participants in the sample will actually take the treatment if assigned to treatment, which means that, roughly half of our treatment group is actually untreated.  In other words, only 25% of the entire sample is actually exposed to the treatment. and the remaining 75% of the sample remain untreated.  

What happens when we calculate the ATE?

In [12]:
noncompliance_data %>%
  group_by(treatment_status) %>%
  summarise(y_observed = mean(y_observed))

treatment_status,y_observed
<int>,<dbl>
0,10.01029
1,10.9721


The observed treatment effect is now only 1 unit!  What happened?  For those that were actually treated, ie those in the treatment group that were also compliers, the treatment effect was realized.  But for those in the treatment group that did not comply and therefore did not receive the treatment, no treatment effect occured.  What we have now is something called the *Intent to Treat Effect* (ITT); we have not really estimated the ATE because not all individuals that we intended to treat actually received the treatment.  But we intended to treat them all, hence the terminology.  

Now that we have noncompliance, we in fact have four groups of participants in our study: Those in the control group that would not have complied if they were treated, those in the control group that would have complied if treated, those in the treatment group that did not comply, and those in the treatment group that complied.  Only the latter actually received treatment:

In [13]:
noncompliance_data %>%
  select(treatment_status, compliance) %>%
  mutate_all(as.factor) %>%
  table() / 1000

                compliance
treatment_status     0     1
               0 0.251 0.256
               1 0.256 0.237

So, what if we are only interested in the treatment effect of those that complied?  This is called the Complier Average Causal Effect (also known as the Local Average Treatment Effect).  How do we recover this?  Simple, we take the ITT and divide it by the proportion of the treatment group that complied!  

<h3><center>$ CACE = \frac{ITT} {Pr_{compliers}}$ </center></h3>

In [14]:
treated_compliers = noncompliance_data %>% filter(compliance==1 & treatment_status==1) %>% nrow()
total_treated = noncompliance_data %>% filter(treatment_status==1) %>% nrow()
proportion_compliers = treated_compliers / total_treated

observed_outcomes = noncompliance_data %>%
  group_by(treatment_status) %>%
  summarise(y_observed = mean(y_observed)) %>%
  pull(y_observed)

diff(observed_outcomes) / proportion_compliers

And we get ~2! Exactly what we defined as our true treatment effect!


But, how do we calculate a standard error?  This depends on the nature of the noncompliance.  If noncompliance is random, such that the compliers and noncompliers were randomly, or as-if randomly, assigned, we can simply filter away the noncompliers from the treatment group and proceed as we would with any other AB test.  We can illustrate this by looking at the potential outcomes

In [15]:
noncompliance_data %>%
  group_by(treatment_status, compliance) %>%
  summarise(y_control = mean(y_under_control),
           y_treatment = mean(y_under_treatment))

treatment_status,compliance,y_control,y_treatment
<int>,<int>,<dbl>,<dbl>
0,0,9.988671,11.98857
0,1,10.031478,12.03673
1,0,9.994911,11.99448
1,1,10.033843,12.02763


Because noncompliance was randomly assigned, the compliers have effectively the same potential outcomes as the noncompliers.  This means that we can effectively filter out the noncompliers from the treatment group, and proceed with calculating the tretment effect and standard errors as we would in any AB test.  

In [16]:
treated_complier_mean = noncompliance_data %>% filter(treatment_status==1 & compliance==1) %>% summarise(y = mean(y_observed)) %>% pull(y)
control_mean = noncompliance_data %>% filter(treatment_status==0) %>% summarise(y = mean(y_observed)) %>% pull(y)

treated_complier_mean - control_mean

(Random variation in the y_observed in the treated noncomplier group means that this estimate is not identical to the CACE calculated above)

But, barring a mistake in the group assignment process, this is highly unlikely to be true in real-world applications -- noncompliance will almost always be random.  And filtering out noncompliers from the treatment group greatly impact the estimated causal effect.

### So, what if noncompliance is not random?

Let's think back to our webpage example... We're testing a feature that is built on some deeper section of the page, and we know that only a limited proportion of the users that we assign to treatment will ever navigate that far.  Noncompliance in this case is likely defined by a user's engagement -- more engaged page users are more likely to navigate further into the page, and therefore more likely to run into our experimental feature.  Noncompliance in this case is far from random. So how do we deal with this? 

In [17]:
treatment_status = rbinom(1000, 1, 0.5)
treatment_effect = rnorm(1000, 2, 0.1)
y_under_control = rnorm(1000, 10, 1)
y_under_treatment = y_under_control + treatment_effect

# now, rather than random, compliance is determined by the baseline level of y
q75 = quantile(y_under_control, probs=0.75)
compliance = ifelse(y_under_control >= q75, 1, 0)
y_observed = y_under_control + treatment_effect*treatment_status*compliance
treated_complier = treatment_status*compliance

nr_noncompliance_data = tibble(
    compliance,
    treated_complier,
    treatment_status,
    treatment_effect,
    y_under_control,
    y_under_treatment,
    y_observed
)

In [18]:
nr_noncompliance_data %>% head()

compliance,treated_complier,treatment_status,treatment_effect,y_under_control,y_under_treatment,y_observed
<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>
1,0,0,2.147833,10.834371,12.9822,10.834371
0,0,0,1.859321,9.301596,11.16092,9.301596
1,0,0,1.811603,11.30924,13.12084,11.30924
0,0,1,1.972263,9.019822,10.99209,9.019822
1,0,0,2.043043,10.747985,12.79103,10.747985
1,1,1,1.987121,11.257797,13.24492,13.244918


Now we've defined compliance to be related to the baseline value of the outcome variable y -- those in the 75th percentile or higher on the baseline value of y are compliers, all others are noncompliers.  Using our webpage example, if we imagine that y is a measure of page engagment, defining compliance in this manner is analagous to saying that only those that are highly engaged will see our experimental feature.  This is of course a simplified example, but it makes intuitive sense; if the feature is deep in the webpage somewhere, we'd only expect highly engaged users to spend enough time on the page to come across it.  Naturally, in real-world cases the determinants of compliance will be much more complex, but this simple example is sufficient to demonstrate the issue with noncompliance.

Given our updated data, let's have a look at the potential outcomes:



In [19]:
nr_noncompliance_data %>%
  group_by(treatment_status, compliance) %>%
  summarise(y_control = mean(y_under_control),
           y_treatment = mean(y_under_treatment))

treatment_status,compliance,y_control,y_treatment
<int>,<dbl>,<dbl>,<dbl>
0,0,9.579083,11.58035
0,1,11.247751,13.25187
1,0,9.59053,11.58303
1,1,11.246198,13.24286


Here it is obvious that we have issues, the potential outcomes for compliers is wildly different from the potential outcomes of noncompliers... We can see what would happen if we filtered out noncompliers from the treatment group like we did in the case of random noncompliance:

In [20]:
treated_compliers = nr_noncompliance_data %>% filter(treatment_status==1 & compliance==1) %>% summarise(y=mean(y_observed)) %>% pull(y)
control = nr_noncompliance_data %>% filter(treatment_status==0) %>% summarise(y=mean(y_observed)) %>% pull(y)

treated_compliers - control

Way over our true treatment effect of 2!  This is because we are capturing both the treatment effect, plus the increased baseline value for the compliers in the treatment group.  Remember, only those with high baseline values of y were compliers.  But the CACE is still valid!  This is because in calculating the CACE we do not throw out any observations, it is based on the difference in means of the __entire group__.

In [21]:
ITT = nr_noncompliance_data %>% group_by(treatment_status) %>% summarise(y = mean(y_observed)) %>% pull(y) %>% diff()
CACE = ITT / (sum(nr_noncompliance_data$treated_complier) / nrow(nr_noncompliance_data %>% filter(treatment_status == 1)))

In [22]:
CACE

### But the standard errors?

In calculating the standard errors there are two sources of variance that need to be accounted for -- one from the estimated treatment effect, and one from the estimated compliance.  The process can be visualized with a DAG:

Treatment status --> Treatment uptake --> outcome variable

Treatment status affects the likelihood that one receives treatment, represented by Treatment uptake, which in turn affects the value of the outcome variable.  

The answer to this problem is instrumental variables regression

In [23]:
suppressWarnings(library(AER))

iv_fit = ivreg(y_observed ~ treated_complier | treatment_status, data=nr_noncompliance_data)

In [24]:
summary(iv_fit)


Call:
ivreg(formula = y_observed ~ treated_complier | treatment_status, 
    data = nr_noncompliance_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.19680 -0.64871 -0.01804  0.63024  3.36233 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      10.00761    0.04378 228.585  < 2e-16 ***
treated_complier  1.93451    0.25865   7.479 1.64e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9926 on 998 degrees of freedom
Multiple R-Squared:   0.5,	Adjusted R-squared: 0.4995 
Wald test: 55.94 on 1 and 998 DF,  p-value: 1.636e-13 


### How to get these estimates 'by hand'

If the data you work with fits easily in memory then calling ```ivreg``` (or ```iv2sls``` from statsmodels in python) is all you need to do; specify your variables in the function and interpret the output.  But, if you work with big data, you will run into problems if you can't fit all the data in memory.  A sample will give you an unbiased estimate of the CACE, but the standard errors will be too large.  You can of course spin up a large VM that has sufficient memory, but I'm going to show how you can obtain these estimates 'by hand' using statistics you can obtain with some simple SQL queries.

We've already seen how to get the CACE, it's just the difference in means between treatment and control divided by the proportion of treatment that were compliers. The standard error of the is a bit more involved.  We take the variance of the residuals, and divide this by the proportion of the total variance in X that is explained by the instrument:

<h3><center>$ se_{CACE} = \sqrt{\frac{\hat{\sigma}^2} {SST_x R^2_{x,z}} }$ </center></h3>

This means that we need:
*  Predicted values for treatment and control (noncompliers in treatment receive the same predicted value as the control group)
*  Residuals for each observeation based on these predicted values
*  The total sum of squares for X, our compliance variable
*  The correlation coefficient between X and Z, our compliance and treatment assignment variables

All of this is should be easy to obtain with a SQL query, which means that we can now estimate an instrumental variables regression on big data without the need for a VM!

In [25]:
y_pred_control = nr_noncompliance_data %>% filter(treatment_status == 0) %>% pull(y_observed) %>% mean()
y_pred_treat_comply = y_pred_control + CACE

x = nr_noncompliance_data$treated_complier
z = nr_noncompliance_data$treatment_status
y = nr_noncompliance_data$y_observed

sst_x = sum((x - mean(x))^2)
r2_xz = cor(nr_noncompliance_data$treated_complier, nr_noncompliance_data$treatment_status)^2
sigma2 = nr_noncompliance_data %>%
  mutate(y_prediction = case_when(treated_complier==0 ~ y_pred_control,
                                  treated_complier==1 ~ y_pred_treat_comply)) %>%
  mutate(errors_squared = (y_observed - y_prediction)^2) %>%
  summarise(sigma_2 = sum(errors_squared) / (n()-2)) %>%
  pull(sigma_2)

The above code gives all of the component parts of the formula so now we just have to plug them in to get the estimated standard error:

In [26]:
iv_se = sqrt(sigma2/(sst_x*r2_xz))
iv_se

Which is the same as the standard error of the complier average causal effect estimate obtained with the ```ivreg()``` function:

In [27]:
iv_fit %>% 
  tidy() %>% 
  filter(term=='treated_complier') %>% 
  pull(std.error)

### Last thoughts, what does this mean for generalizability?

So when we have noncompliance we can obtain the complier average causal effect by scaling the intent to treat effect by the proportion of the treated that complied.  But does this mean that our results cannot be generalized to the wider population? Of course, that's exactly what this means!  With systematic noncompliance, the complier average causal effect cannot be generalized to noncompliers as well because these two groups have different potential outcomes.  Noncompliers were never treated, so in effect no experiment was run with their involvement... And you can never generalize to a group that was unrepresented in any experiment.  The inclusion of noncompliers in the sample doesn't change that.  Sometimes, such as the new feature on a deep section of a webpage, that is exactly what you want -- you only want an estimate of the effect of the feature on users that will actually see said feature. And sometimes the CACE is not exactly what you want, but is the best you can get.  In either case, be aware that with noncompliance your results only generalize to compliers.  