In [1]:
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm

  from pandas import Int64Index as NumericIndex


# Exercise 1: Game Fun

In [2]:
df_game = pd.read_excel("GameFun.xlsx")

In [3]:
df_game.head()

Unnamed: 0,id,test,purchase,site,impressions,income,gender,gamer
0,1956,0,0,site1,0,100,1,0
1,45821,1,0,site1,20,70,1,0
2,59690,1,0,site1,22,100,1,0
3,18851,0,0,site1,13,90,1,0
4,60647,1,0,site1,12,60,1,0


In [4]:
df_game.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40048 entries, 0 to 40047
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           40048 non-null  int64 
 1   test         40048 non-null  int64 
 2   purchase     40048 non-null  int64 
 3   site         40048 non-null  object
 4   impressions  40048 non-null  int64 
 5   income       40048 non-null  int64 
 6   gender       40048 non-null  int64 
 7   gamer        40048 non-null  int64 
dtypes: int64(7), object(1)
memory usage: 2.4+ MB


**1. Before evaluating the effect of an experiment, it is important to make sure that the experiment was executed correctly. Check whether the test and control groups are probabilistically equivalent on their observables?**          
a. More specific, compare the averages of the income, gender and gamer variables in the test and control groups. You should also report the % difference in the averages. Compute its statistical significance. [2 pts]        
b. Briefly comment on what these metrics tell you about probabilistic equivalence for this experiment. [2 pts]              
c. If you had run this type of analysis BEFORE executing an experiment and found a large difference between test and control groups, what you should do? [5 pts]                     
d. (Open/Ended Question) If you had millions of consumers, your “classic” statistical significance tests would not work (this is because the number of samples is used to compute those classic statistical tests). Do some research online and propose what significance test would you do in case you had “big data”? [5 pts]       

In [5]:
# Compute Average of income, gender and gamer in test and control. Report % difference in these averages.
df_mean_diff = df_game.groupby('test')[['income','gender','gamer']].mean().T
df_mean_diff.head()
df_mean_diff['percent_diff_in_avg'] =np.round(((df_mean_diff[0] - df_mean_diff[1])/df_mean_diff[1]),5)
df_mean_diff.head()

test,0,1,percent_diff_in_avg
income,55.166012,54.938236,0.00415
gender,0.647905,0.647289,0.00095
gamer,0.601823,0.601331,0.00082


***We can see the average values of gender, gamer and income across both the treatment and control groups. The percent diffrence in averages is also reported in the above table.***

In [6]:
# Checking for statistical significance of the gender difference
treatment_gender = df_game[df_game['test']==1]['gender']
control_gender = df_game[df_game['test']==0]['gender']

In [7]:
# Before applying two sample t-test, we want to confirm equal variance or not. Since its equal we can apply the right t-test.
print("Variance of gender in treatment group:",  np.var(treatment_gender))
print("Variance of gender in control group:",  np.var(control_gender))
print("How many times is the larger variance than the smaller variance :",  np.var(treatment_gender) / np.var(control_gender))

Variance of gender in treatment group: 0.22830590118167943
Variance of gender in control group: 0.22812411307788086
How many times is the larger variance than the smaller variance : 1.0007968824572986


In [8]:
# t-test
stats.ttest_ind(a=treatment_gender, b=control_gender, equal_var=True)

Ttest_indResult(statistic=-0.11804408014871089, pvalue=0.906033323148871)

**We can see that that the two groups are equivalent in terms of gender as the t-test shows that the difference between the averages is not statistically significant.**

In [9]:
# Checking for statistical significance of the income difference
treatment_income = df_game[df_game['test']==1]['income']
control_income = df_game[df_game['test']==0]['income']

In [10]:
# Before applying two sample t-test, we want to confirm equal variance or not. Since its equal we can apply the right t-test.
print("Variance of income in treatment group:",  np.var(treatment_income))
print("Variance of income in control group:",  np.var(control_income))
print("How many times is the larger variance than the smaller variance :",  np.var(treatment_income) / np.var(control_income))

Variance of income in treatment group: 188.74970062410344
Variance of income in control group: 186.80233060888244
How many times is the larger variance than the smaller variance : 1.0104247629506202


In [11]:
# t-test
stats.ttest_ind(a=treatment_income, b=control_income, equal_var=True)

Ttest_indResult(statistic=-1.520640253683462, pvalue=0.1283580345995143)

**We can see that that the two groups are equivalent in terms of income as the t-test shows that the difference between the averages is not statistically significant.**

In [12]:
# Checking for statistical significance of the gamer difference
treatment_gamer = df_game[df_game['test']==1]['gamer']
control_gamer = df_game[df_game['test']==0]['gamer']

In [13]:
# Before applying two sample t-test, we want to confirm equal variance or not. Since its equal we can apply the right t-test.
print("Variance of gamer in treatment group:",  np.var(treatment_gamer))
print("Variance of gamer in control group:",  np.var(control_gamer))
print("How many times is the larger variance than the smaller variance :",  np.var(treatment_gamer) / np.var(control_gamer))

Variance of gamer in treatment group: 0.2397319499525355
Variance of gamer in control group: 0.23963203598263988
How many times is the larger variance than the smaller variance : 1.0004169474648326


In [14]:
# t-test
stats.ttest_ind(a=treatment_gamer, b=control_gamer, equal_var=True)

Ttest_indResult(statistic=-0.09199349089131977, pvalue=0.9267036713286598)

**We can see that that the two groups are equivalent in terms of gamer/non-gamer as the t-test shows that the difference between the averages is not statistically significant.**

**2. Evaluate the average purchase rates in the test and control for the following groups. For each comparison, report the average purchase rate for the test, average purchase rate for the control and the absolute difference (not the % difference) between the test and control.**              
a. Comparison 1: All customers [2 pts]               
b. Comparison 2: Male vs Female customers [2 pts]                 
c. Comparison 3: Gamers vs Non-Gamers Customers [2 pts]               
d. Comparison 4: Female Gamers vs Male Gamers [2 pts]                  


### Comparison 1

In [15]:
c1 = df_game.groupby('test')['purchase'].mean()
c1

test
0    0.036213
1    0.076822
Name: purchase, dtype: float64

In [16]:
print("Purchase Rate Absolute Difference between Treatment and Control for ALL Customers :", np.round(c1[1] - c1[0],4))

Purchase Rate Absolute Difference between Treatment and Control for ALL Customers : 0.0406


### Comparison 2

In [17]:
c2 = df_game.groupby(['test', 'gender'])['purchase'].mean()
c2.head()

test  gender
0     0         0.034442
      1         0.037176
1     0         0.080945
      1         0.074575
Name: purchase, dtype: float64

In [18]:
print("Purchase Rate Absolute Difference between Treatment and Control for Male Customers :", np.round(c2[1,1] - c2[0,1],4))
print("Purchase Rate Absolute Difference between Treatment and Control for Female Customers :", np.round(c2[1,0] - c2[0,0],4))

Purchase Rate Absolute Difference between Treatment and Control for Male Customers : 0.0374
Purchase Rate Absolute Difference between Treatment and Control for Female Customers : 0.0465


### Comparison 3

In [19]:
c3 = df_game.groupby(['test', 'gamer'])['purchase'].mean()
c3.head()

test  gamer
0     0        0.037387
      1        0.035436
1     0        0.035092
      1        0.104487
Name: purchase, dtype: float64

In [20]:
print("Purchase Rate Absolute Difference between Treatment and Control for Gamer Customers :", np.round(c3[1,1] - c3[0,1],4))
print("Purchase Rate Absolute Difference between Treatment and Control for Non-Gamer Customers :", np.round(c3[1,0] - c3[0,0],4))

Purchase Rate Absolute Difference between Treatment and Control for Gamer Customers : 0.0691
Purchase Rate Absolute Difference between Treatment and Control for Non-Gamer Customers : -0.0023


### Comparison 4

In [21]:
c4 = df_game[df_game['gamer'] == 1].groupby(['test', 'gender'])['purchase'].mean()
c4.head()

test  gender
0     0         0.032041
      1         0.037275
1     0         0.110092
      1         0.101404
Name: purchase, dtype: float64

In [22]:
print("Purchase Rate Absolute Difference between Treatment and Control for Male Gamers :", np.round(c4[1,1] - c4[0,1],4))
print("Purchase Rate Absolute Difference between Treatment and Control for Female Gamers :", np.round(c4[1,0] - c4[0,0],4))

Purchase Rate Absolute Difference between Treatment and Control for Male Gamers : 0.0641
Purchase Rate Absolute Difference between Treatment and Control for Female Gamers : 0.0781


***3. Assess the expected revenue in the test vs. control for the following comparisons:***                        
a. Comparison 1: All customers [4 pts]
b. Comparison 4: Female Gamers vs Male Gamers [4 pts]                 

**It is vital to understand for expected revenue calculation that :-**       
- the revenue inflow per subscription for someone in the treatment group = **12.5 dollars (37.5 minus promotional value 25).**
- the revenue per subscription for someone in the control group = **37.5 dollars  (since there is no promotion received by user)**

### Comparison 1: All customers

In [23]:
c5 = df_game.groupby('test')['purchase'].sum()
c5

test
0     433
1    2158
Name: purchase, dtype: int64

In [24]:
print("Expected Revenue in Control group: {:.2f}".format(37.5 * c5[0]))
print("Expected Revenue in Treatment group: {:.2f}".format(12.5 * c5[1]))
print("Increase in Revenue in Treatment group: {:.2f}".format((12.5 * c5[1]) - (37.5 * c5[0])))

Expected Revenue in Control group: 16237.50
Expected Revenue in Treatment group: 26975.00
Increase in Revenue in Treatment group: 10737.50


### Comparison 4: Female Gamers vs Male Gamers

In [25]:
c6 = df_game[df_game['gamer'] == 1].groupby(['test', 'gender'])['purchase'].sum()
c6.head()

test  gender
0     0           81
      1          174
1     0          660
      1         1105
Name: purchase, dtype: int64

In [26]:
print("Expected Revenue in Control Female Gamer group: {:.2f}".format(37.5 * c6[0,0]))
print("Expected Revenue in Treatment Female Gamer group: {:.2f}".format(12.5 * c6[1,0]))
print("Increase in Revenue in Treatment group: {:.2f}".format(((12.5 * c6[1,0]) - (37.5 * c6[0,0]))))

Expected Revenue in Control Female Gamer group: 3037.50
Expected Revenue in Treatment Female Gamer group: 8250.00
Increase in Revenue in Treatment group: 5212.50


In [27]:
print("Expected Revenue in Control Male Gamer group: {:.2f}".format(37.5 * c6[0,1]))
print("Expected Revenue in Treatment Male Gamer group: {:.2f}".format(12.5 * c6[1,1]))
print("Increase in Revenue in Treatment group: {:.2f}".format(((12.5 * c6[1,1]) - (37.5 * c6[0,1]))))

Expected Revenue in Control Male Gamer group: 6525.00
Expected Revenue in Treatment Male Gamer group: 13812.50
Increase in Revenue in Treatment group: 7287.50


**4. Based on your previous answers, provide a brief recommendation to your management team summarizing the expected financial outcome for Game-Fun.**           
a. Should Game-Fun run this promotion again in the future? If no, explain why. If
yes, should Game-Fun offer it to all customers or a targeted segment. [10 pts]

- We can see that there is an increase in expected revenue for the treatment group. 
- However there are tradeoffs, as the the promotion offer offsets the revenue inflow for the treatment group.
- Despite this, the more number of paid subscribers will increase organic traffic or subscription since the K-factor will increase.
- Considering this, we can recommend Game Fun to run such a promotion in future.


Questions:      
1. For each treatment customer making a purchase the revenue is 12.5, but for each control customer making a purchase the revenue is 37.5. Is that understanding correct?
2. Revenue means the net inflow. Is that correct?

# Exercise 2: Non-Compliance in Randomized Experiements

In [28]:
df_sommer = pd.read_csv("sommer_deger.csv")
df_sommer.head()

Unnamed: 0,instrument,treatment,outcome
0,0,0,0
1,0,0,0
2,0,0,0
3,0,0,0
4,0,0,0


***1. The first data scientist advised that one should compare the survival rate of babies whose mothers were offered Vitamin A shots to the survival rate of babies whose mothers were not offered a Vitamin A shot.***         
a. What percent of babies whose mothers were offered Vitamin A shots for their babies died? [3 pts]                 
b. What percent of babies whose mothers were not offered Vitamin A shots for their babies died? [3 pts]           
c. What is the difference in mortality, and under what assumptions is the difference between these two percentages a valid estimate of the causal impact of receiving vitamin A shots on survival? [4 pts]                     
   

In [29]:
df_prop_diff1 = df_sommer.groupby('instrument')[['outcome']].mean().T
df_prop_diff1.head()

instrument,0,1
outcome,0.006386,0.003804


In [30]:
 (df_prop_diff1.iloc[0,1] - df_prop_diff1.iloc[0,0]) / 0.4123

-0.0062633459140857845

- a. 0.3804 % of babies whose mothers were offered Vitamin A shots for their babies died.
- b. 0.6386 % of babies whose mothers were not offered Vitamin A shots for their babies died.
- c. The difference in mortality rate ('Offered' - 'Not Offered') is 0.26%. 

The question asks for the assumptions required for this difference being valid estimate of causal impact of **receiving Vitamin A** on mortality. However we must note that the independent variable used in this case is not **received Vitamin A**(treatment), but **offered Vitamin A**(instrument). Therefore certainly the difference cannot be causal effect of treatment.       

In general though, for the effect to be causal, we have to assume that there are no systematic differences between the two groups - treatment and control, or in other words, the choice of who were offered Vitamin A vs not should have been decided based on random assignment, so that the possibility of confounding variables (observed or unobserved) is mitigated as much as possible. 

In [31]:
Y = df_sommer[['outcome']]
Z = df_sommer[['instrument']]
Y = Y.astype(float)
Z = Z.astype(float)
Z = sm.add_constant(Z)
model1 = sm.OLS(Y,Z)
results1 = model1.fit()

In [32]:
print(results1.summary())

                            OLS Regression Results                            
Dep. Variable:                outcome   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     7.830
Date:                Fri, 08 Apr 2022   Prob (F-statistic):            0.00514
Time:                        09:30:57   Log-Likelihood:                 29040.
No. Observations:               23682   AIC:                        -5.808e+04
Df Residuals:                   23680   BIC:                        -5.806e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0064      0.001      9.683      0.0

In [33]:
results1.params

const         0.006386
instrument   -0.002582
dtype: float64

***2. The second data scientist advised that one should compare the survival rates of babies who received Vitamin A shots to babies who did not receive Vitamin A shots.***                     
a. What percent of babies who received Vitamin A shots died? [3 pts]             
b. What percent of babies who did not receive Vitamin A shots died? [3 pts]                
c. What is the difference in mortality, and under what assumptions is the               
difference between these two percentages a valid estimate of the causal impact                
of receiving vitamin A shots on survival? [4 pts]             **


In [34]:
df_prop_diff2= df_sommer.groupby('treatment')[['outcome']].mean().T
df_prop_diff2.head()

treatment,0,1
outcome,0.00771,0.00124


- a. 0.124 % of babies whose mothers were gave Vitamin A shots to their babies died.
- b. 0.771 % of babies whose mothers did not give Vitamin A shots to their babies died.
- c. Therefore difference in mortality rate ('Received' - 'Not Received') is 0.65%. 
      

For the difference in percentages to be a causal effect of receiving Vitamin A on mortality rate, we have to assume that there are no systematic differences between the two groups - those who received Vitamin A and those who didn't, or in other words, the choice of who were received Vitamin A vs not should have been decided based on random assignment, so that the possibility of confounding variables (observed or unobserved) is mitigated as much as possible. 

However, since we are not sure that the groups of mothers who gave their babies the vitamin vs those who didn't give their babies vitamin are identical in all aspects (observed or unobserved), in other words, it is possible that the mother not giving the baby treatment could be a non-random event, and it could be influenced by other factors (possibly unobserved), so we cannot consider this percentage difference to be the causal effect.

In [35]:
Y = df_sommer[['outcome']]
X = df_sommer[['treatment']]
Y = Y.astype(float)
X = X.astype(float)
X = sm.add_constant(X)
model2 = sm.OLS(Y,X)
results2 = model2.fit()

In [36]:
print(results2.summary())

                            OLS Regression Results                            
Dep. Variable:                outcome   R-squared:                       0.002
Model:                            OLS   Adj. R-squared:                  0.002
Method:                 Least Squares   F-statistic:                     47.61
Date:                Fri, 08 Apr 2022   Prob (F-statistic):           5.34e-12
Time:                        09:30:57   Log-Likelihood:                 29060.
No. Observations:               23682   AIC:                        -5.812e+04
Df Residuals:                   23680   BIC:                        -5.810e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0077      0.001     12.864      0.0

In [37]:
results1.params

const         0.006386
instrument   -0.002582
dtype: float64

3. The third data scientist advised that one should consider only babies whose mothers were offered Vitamin A shots, and compare babies who received shots to babies who did not receive shots.                        
a. What percent of babies who received Vitamin A shots died? [3 pts]                            
b. What percent of babies whose mothers were offered Vitamin A shots, but the mothers did not accept them, died? [3 pts]        
c. What is the difference in mortality, and under what assumptions is the difference between these two percentages a valid estimate of the causal impact of receiving vitamin A shots on survival? [4 pts]                           



In [38]:
df_prop_diff3= df_sommer[df_sommer['instrument']==1].groupby('treatment')[['outcome']].mean().T
df_prop_diff3.head()

treatment,0,1
outcome,0.014055,0.00124


***Among the mothers who were offered Vitamin A shots for their babies,***

- a. 0.124 % of babies whose mothers gave Vitamin A shots to their babies died.
- b. 1.4 % of babies whose mothers did not give Vitamin A shots to their babies died.
- c. Therefore difference in mortality rate is - 1.275%. 
      

For the difference in percentages to be a causal effect of receiving Vitamin A on mortality rate, we have to assume that there are no systematic differences between the two groups - those who received Vitamin A and those who didn't, or in other words, the choice of who were received Vitamin A vs not should have been decided based on random assignment, so that the possibility of confounding variables (observed or unobserved) is mitigated as much as possible. 

However, since we are not sure that the groups of mothers who gave their babies the vitamin vs those who didn't give their babies vitamin are identical in all aspects (observed or unobserved), in other words, it is possible that the mother not giving the baby treatment could be a non-random event, and it could be influenced by other factors (possibly unobserved), so we cannot consider this percentage difference to be the causal effect.

**4. The fourth data scientist suggested the following Wald estimator for the effect of Vitamin A shots on mortality:**     
a. Compute the above Wald estimate for the given dataset. [2 pts]                       
b. Under what assumptions is this estimate a valid estimate of the causal impact of vitamin A shots on survival? [4 pts]        
c. What is the standard error for the intent-to-treat estimate recommended by the first data scientist? What is the standard error for the Wald estimate recommended by the fourth data scientist? [5 pts]                      
i. Which one is larger and why? [4 pts]                  
ii. Why might these standard errors be biased? What information would you ideally want to have to address this bias? [5 pts]  

In [39]:
from linearmodels.iv import IV2SLS

In [40]:
formula = 'outcome ~ 1 + [treatment ~ instrument]'
IVmodel = IV2SLS.from_formula(formula, df_sommer).fit(cov_type='unadjusted')

In [41]:
IVmodel.summary

0,1,2,3
Dep. Variable:,outcome,R-squared:,0.0015
Estimator:,IV-2SLS,Adj. R-squared:,0.0015
No. Observations:,23682,F-statistic:,7.8396
Date:,"Fri, Apr 08 2022",P-value (F-stat),0.0051
Time:,09:30:57,Distribution:,chi2(1)
Cov. Estimator:,unadjusted,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
Intercept,0.0064,0.0007,9.6889,0.0000,0.0051,0.0077
treatment,-0.0032,0.0012,-2.7999,0.0051,-0.0055,-0.0010


a. The Wald Estimate value as obtained from the above 2SLS regression is -0.0032. 
b. This estimate is a valid estimate of the causal impact under following assumptions:

1. The intent to treat (offer of vitaminA) is uncorrelated with any external confounding variable (exogenous in this context). We have assigned this choice of mohers/babies to be treated in a random fashion.
2. The intent to treat (offer of vitaminA) is not same as the treatment itself (not completely deterministic).
3. Standard error of the first estimate is 0.001. Standard error of the Wald estimate is 0.0012. 
    -Standard error of the Wald estimate seems to be higher due to the denominator having additional term.
    -These standard errors might be biased as we do not know the exact distributions of the variables. 

# Exercise 3: Causal Inference in Observational Studies

**Read the paper “The Design Versus the Analysis of Observational Studies for Causal Effects:Parallels With the Design of Randomized Trials” by Donald Rubin.**              
Write a short reflection paragraph up to half a page with your comments. You can use bullet
points to write your comments. Comments can be 1) points that resonated most with you, 2)
points of disagreement, 3) a comparison to what you’ve learned in previous classes, or 4)
anything else you want to comment on as long as they demonstrate that you read and thought
about the paper. [10 pts]

Below are some the ideas that really resonated with me, along with additional comments wherever applicable.

- One of the interesting ideas that really resonated with me is the fact the Donald Rubin stresses on the fact that **Randomized Experiments and Observational studies are less of a dichotomy but rather on continuum of whether the situation is well-suited for causal inferences or not**. Or in other words, a poorly designed or low quality randomized experiement could provide less accurate Causal inference than a very well designed "hidden experiment"-finding in observational data.

- Another vital thing that **'Design trumps Analysis'**. As an example, he mentions that while designing the observational study groups, its important not to implement outcome data during design phase (**'No outcome data in sight'**) so that proper fidelity is maintained in the design.

- The paper also provides good examples of what a **badly designed causal study from observational data looks like**. The author gives the example of the **U.S. Tobacco litigation** and goes on to show how the treatment (Smokers) and control(non-smokers) groups used to make a case of impact of differential healthcare impact and cost on these **two groups are 'not equivalent' or 'like' each other in terms of distribution of background variables**. He works on proving his point by a careful study of all the covariates used (based on what was legally allowed by the court for usage) and their subclassification and matching based on them, that the effects ultimately studied cannot be deemed causal.

- Finally to conclude, the paper outlines a very **good outline how to perform the 'assignment' phase of the RCM (Rubin Causal Model)**. How to define the units, treatments, covariates, considerations of SUTVA assumptions and an array of ideas that need to be concretly verified before making causal analysis on the observational data.

