# CUPED questions

- Understand a number of aspects of CUPED

- How does it compare to regadjustment

- Adding a mean or not?

In [1]:
from decimal import Decimal

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

import src.helpers as hp

%load_ext autoreload
%autoreload 2


Load example data

In [2]:
fp = '/Users/fabian.gunzinger/tmp/cuped_example.parquet'
df = pd.read_parquet(fp)
df.head()

Unnamed: 0,user_id,order_price,order_price_pre,t
319393,JE:IE:1000007,23.146154,21.619444,0
319399,JE:IE:1000063,13.44,21.25,1
319400,JE:IE:1000064,19.245001,14.7,0
319401,JE:IE:1000071,22.625,27.049999,0
319402,JE:IE:1000076,20.034,20.834167,1


Simple regression model

In [3]:
formula = 'order_price ~ t'
res_simpreg = smf.ols(formula, data=df).fit()

t_simpreg = res_simpreg.params['t']
int_simpreg = res_simpreg.params['Intercept']

print(res_simpreg.summary().tables[1])

                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     23.6222      0.024    972.818      0.000      23.575      23.670
t              1.1629      0.034     33.865      0.000       1.096       1.230


Regression adjustment

In [4]:
formula = 'order_price ~ t + order_price_pre'
res_multireg = smf.ols(formula, data=df).fit()

t_regadjust = res_multireg.params.t
int_regadjust = res_multireg.params.Intercept

print(res_multireg.summary().tables[1])
print(t_simpreg)
print(t_regadjust)

                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept           9.5653      0.042    228.211      0.000       9.483       9.647
t                   1.1693      0.029     40.341      0.000       1.113       1.226
order_price_pre     0.6033      0.002    384.477      0.000       0.600       0.606
1.1629330328015293
1.1693386503812886


The results are almost identical but not quite. They would be if t was perfectly independent of order price which, in theory, it is but in practice it isn't.

In [5]:
df[['order_price_pre', 't']].corr()

Unnamed: 0,order_price_pre,t
order_price_pre,1.0,-0.000575
t,-0.000575,1.0


Verifying FWL

- The result is perfectly identical to those from the multiple regression model, just as FWL shows. (They differ a tiny bit due to floating point precision limits)

In [58]:
df['order_price_res'] = smf.ols('order_price ~ order_price_pre', data=df).fit().resid
df['t_res'] = smf.ols('t ~ order_price_pre', data=df).fit().resid

formula = 'order_price_res ~ t_res'
res_fwl = smf.ols(formula, data=df).fit()

t_fwl = res_fwl.params.t_res
int_fwl = res_fwl.params.Intercept

print(res_fwl.summary().tables[1])
print(t_regadjust)
print(t_fwl)

                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   8.482e-14      0.014   5.85e-12      1.000      -0.028       0.028
t_res          1.1693      0.029     40.341      0.000       1.113       1.226
1.1693386503812886
1.1693386503848742


CUPED

In [85]:
def cuped_adjust_with_mean(df, y, x):
    data = df.dropna(subset=[y, x])
    cv = np.cov([data[y], data[x]])
    theta = cv[0, 1] / cv[1, 1]
    y, x = data[y], data[x]
    return (y - (x - x.mean()) * theta).fillna(y)


df['order_price_cuped_with_mean'] = cuped_adjust_with_mean(df, 'order_price', 'order_price_pre')

formula = 'order_price_cuped_with_mean ~ t'
res_cuped_with_mean = smf.ols(formula, data=df).fit()

t_cuped_with_mean = res_cuped_with_mean.params.t
int_cuped_with_mean = res_cuped_with_mean.params.Intercept

print(res_cuped_with_mean.summary().tables[1])
print(t_regadjust)
print(t_cuped_with_mean)

                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     23.6190      0.020   1152.333      0.000      23.579      23.659
t              1.1693      0.029     40.341      0.000       1.113       1.226
1.1693386503812886
1.1693382642527503


Notice that while in regression adjustment we residualise both t and y, in cuped we only residualise y. See whether cuped is even more similar to regression result with only residualised y -- it is, as expected!

In [60]:
formula = 'order_price_res ~ t'
res_yresid = smf.ols(formula, data=df).fit()

t_yresid = res_yresid.params.t
int_yresid = res_yresid.params.Intercept

print(res_yresid.summary().tables[1])
print(t_cuped_with_mean)
print(t_yresid)



                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.5847      0.020    -28.525      0.000      -0.625      -0.544
t              1.1693      0.029     40.341      0.000       1.113       1.226
1.1693382642527503
1.169338264079783


FWL with added mean

- In FWL we loose intercept. Let's add means to add it back.

In [None]:
df['order_price_res_with_mean'] = (
    smf.ols('order_price ~ order_price_pre', data=df).fit().resid + df.order_price.mean()
)
df['t_res_with_mean'] = smf.ols('t ~ order_price_pre', data=df).fit().resid + df.t.mean()

formula = 'order_price_res_with_mean ~ t_res_with_mean'
res_fwl_with_mean = smf.ols(formula, data=df).fit()
print(res_fwl_with_mean.summary().tables[1])

t_fwl_with_mean = res_fwl_with_mean.params.t_res_with_mean
int_fwl_with_mean = res_fwl_with_mean.params.Intercept

print(t_fwl_with_mean)
print(t_fwl)



                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          23.6190      0.020   1152.333      0.000      23.579      23.659
t_res_with_mean     1.1693      0.029     40.341      0.000       1.113       1.226
1.169338650384866
1.1693386503848742


CUPED with no mean adjustment

In [None]:
def cuped_adjust_no_mean(df, y, x):
    data = df.dropna(subset=[y, x])
    cv = np.cov([data[y], data[x]])
    theta = cv[0, 1] / cv[1, 1]
    y, x = data[y], data[x]
    return (y - x * theta).fillna(y)


df['order_price_cuped_no_mean'] = cuped_adjust_no_mean(df, 'order_price', 'order_price_pre')

formula = 'order_price_cuped_no_mean ~ t'
res_cuped_no_mean = smf.ols(formula, data=df).fit()
print(res_cuped_no_mean.summary().tables[1])

t_cuped_no_mean = res_cuped_no_mean.params.t
int_cuped_no_mean = res_cuped_no_mean.params.Intercept

                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      9.5662      0.020    466.720      0.000       9.526       9.606
t              1.1693      0.029     40.341      0.000       1.113       1.226


### Understanding differences in intercept

In [104]:
print("{:<20}: {:<20.15}: {:<20.15}".format("regadjust", int_regadjust, t_regadjust))
print("{:<20}: {:<20.15}: {:<20.15}".format("fwl", int_fwl, t_fwl))
print("{:<20}: {:<20.15}: {:<20.15}".format("fwl_with_mean", int_fwl_with_mean, t_fwl_with_mean))
print()
print("{:<20}: {:<20.15}: {:<20.15}".format("cuped_no_mean", int_cuped_no_mean, t_cuped_no_mean))
print("{:<20}: {:<20.15}: {:<20.15}".format("yresid", int_yresid, t_yresid))
print(
    "{:<20}: {:<20.15}: {:<20.15}".format("cuped_with_mean", int_cuped_with_mean, t_cuped_with_mean)
)

regadjust           : 9.56533868102738    : 1.16933865038129    
fwl                 : 8.48211738659945e-14: 1.16933865038487    
fwl_with_mean       : 23.6189530776169    : 1.16933865038487    

cuped_no_mean       : 9.56618603498113    : 1.16933826494191    
yresid              : -0.5846691320398    : 1.16933826407978    
cuped_with_mean     : 23.6189535772187    : 1.16933826425275    


What do we learn from all of this?

- Once we move past the fact that all results are identical for all practical purposes, we can separate the results into two blocks, which I have done. The first block is the regression adjustment results, the second block is the cuped results.

- Notice how the t coefficients start to differ between the two blocks from the 7th decimal onwards and then within the blocks from the 12th (first block) and 10th (second block) decimal onwards. Notice, too, that in each block, the intercept values are 9.6, 0, 23.6.

- Let's focus on the t coefficient, first.

- FWL holds -- regression adjustment and residualised regression are identical except for slight differences due to flating point imprecision (FWL is a mathematically proven theorem, so of course it holds!)

- Adding mean values to residualised variables has no impact whatsoever on fwl result (they are exactly identical), except for change in intercept. Will talk about this below.

- Why do the CUPED results differ by more than what we can ascribe to flating point math imprecision?

- Because in regression adjustment, we residualise both y and t, while in CUPED we residualise y only. Now, in theory there is no need to residualise y, because it's random and thus perfectly uncorrelated with x. But in practice, there can be a very small correlation (which we've seen above). This is what accounts for the difference.

- We can verify this logic by comparing CUPED to a version of the fwl regression where we only residualise y but not t and -- sure enough -- this is virtually identical to CUPED except for the floating point imprecision.

- Now, let's focus on the intercept terms. 

- The intercept of the regression adjustment equation is the mean of y if t and x are zero.

- The mean of the fwl equation has the same interpretation, but in the process of residualising y and t we have effectively demeaned them, since the expected mean value of residuals of an ols regression is zero.

- We can add the mean back on when explicitly adding it to the residualised values. Then the mean is 23.6, which is the mean value of y if t = 0.

- How do the two slightly different versions of CUPED fit in here? In practice, you see two ways to perform the CUPED adjustment, one that adds the mean of x (y - theta(x - mean_x)) and one that doesn't (y - theta*x). The one that doesn't add the mean value is correct in the sense that it is the one that follows from the math, and it's the one that produces an intercept equal to that that of regression adjustment. Adding the mean produces an intercept equivalent to that of mean y if t = 0.

- Why? Let's compare the adjustments:

$$
\begin{align*}
\tilde{y}_i^{cuped} = y_i - \theta(x_i - \bar{x}) &= y_i - (\hat{\alpha} + \hat{\delta} x_i) + \bar{y} = \tilde{y}_i^{FWL} \\
y_i - \theta(x_i - \bar{x}) &= y_i - (\hat{\alpha} + \theta x_i) + \bar{y} \quad \text{theta is delta hat}\\
y_i - \theta x_i + \theta \bar{x} &= y_i - \hat{\alpha} - \theta x_i + \bar{y} \quad \text{expanding equation}\\
\theta \bar{x} &= \bar{y} - \hat{\alpha} \quad \text{cancelling terms -- this is the crucial bit} \\
\hat{\alpha}  &= \bar{y} - \theta \bar{x} \quad \text{this is why it's true -- formula for intercept from simple reg} \\
\end{align*}
$$

- So, what is going on here? It turns out that adding $\theta \bar{x}$ to cuped adjusted value is exactly identical to adding $\bar{y}$ to residualised value. Hence, doing it is simply a way to recover the intercept.

- This answers a question I had earlier: how come CUPED is identical to regression adjustment even though we don't demean values in CUPED? Well, notice first that demeaning only affects the intercept, not the coefficient on t, so it's not surprising that these would be identical. Why is the intercept of not adding mean to adjustment identical to regression adjustment? tbd some other time. Last question I have to answer.

- If we want to use the sample mean for reporting, then we should add the mean value.


