## __Data Analytics Recap__
####   001 - Yumi Jin

### __E1. Game Fun: Customer Acquisition through Digital Advertising__

#### 1. Before evaluating the effect of an experiment, it is important to make sure that the experiment was executed correctly. Check whether the test and control groups are probabilistically equivalent on their observables?

a. More specific, compare the averages of the income, gender and gamer variables in the test and control groups. You should also report the % difference in the averages. Compute its statistical significance. [2 pts]

In [28]:
import pandas as pd
import numpy as np
import openpyxl
from scipy.stats import ttest_ind

In [19]:
path = "/Users/yumi/ucdavis/Spring Quarter/BAX-423/Homework/Homework 1/GameFun.xlsx"
data = pd.read_excel(path)

In [20]:
data.head()

Unnamed: 0,id,test,purchase,site,impressions,income,gender,gamer
0,1956,0,0,site1,0,100,1,0
1,45821,1,0,site1,20,70,1,0
2,59690,1,0,site1,22,100,1,0
3,18851,0,0,site1,13,90,1,0
4,60647,1,0,site1,12,60,1,0


In [24]:
# subset the dataset to test and control groups
test = data[data['test'] == 1]
control = data[data['test'] == 0]

In [39]:
# calculate means for income, gender, and gamer in both groups
mean_test = test[['income', 'gender', 'gamer']].mean()
mean_control = control[['income', 'gender', 'gamer']].mean()

# print('Test:\n', mean_test)
# print('Control\n', mean_control)

__Test:__  
Avg(Income) = 54.938236  
Avg(Gender) = 0.647289  
Avg(Gamer) = 0.601331    
__Control:__  
Avg(Income) = 55.166012  
Avg(Gender) = 0.647905  
Avg(Gamer) = 0.601823

In [54]:
# calculate percentage difference in means
percent_diff = (mean_test - mean_control) / mean_control * 100

# perform t-tests for statistical significance
t_test_income = ttest_ind(test['income'], control['income'])
t_test_gender = ttest_ind(test['gender'], control['gender'])
t_test_gamer = ttest_ind(test['gamer'], control['gamer'])

# print(percent_diff,'\n\n', t_test_income,'\n', t_test_gender,'\n', t_test_gamer)

__Percentage Difference:__  
Income: -0.41%  
Gender: -0.10%  
Gamer: -0.08%  

__Statistical Significance__
Income: p-value = 0.128 (not significant)  
Gender: p-value = 0.906 (not significant)  
Gamer: p-value = 0.927 (not significant)  

__Answer:__ The percentage differences in the means for income, gender, and gamer status between the test and control groups are very small. And there are no significant differences in these variables between the groups. Thus the test and control groups are probabilistically equivalent on these observables.

<hr style="border:0.5px solid gray">

b. Briefly comment on what these metrics tell you about probabilistic equivalence for this experiment. [2 pts]

__Answer:__ These metrics suggest that the test and control groups are probabilistically equivalent regarding the variables examined. This equivalence is critical because it implies that the two groups were likely subject to similar conditions except for the intervention being tested. Thus, any differences in outcomes between the test and control groups can more confidently be attributed to the experimental intervention rather than differences in baseline characteristics. 

<hr style="border:0.5px solid gray">

c. If you had run this type of analysis BEFORE executing an experiment and found a large difference between test and control groups, what you should do? [5 pts] 

__Answer:__ If, before executing an experiment, the preliminary analysis reveals significant differences between the test and control groups in key characteristics, I would consider using the following methods:  
i.	Reassess the Randomization Process by reviewing the randomization procedure.  
ii.	Rebalance the groups using techniques like A/B with Matching, DiD, and Phased Roll Outs.  
iii. Use statistical techniques such as ANCOVA to adjust for the differences in the variables.  
iv.	Increase the sample size to minimize the effects and differences.  


<hr style="border:0.5px solid gray">

d. (Open/Ended Question) If you had millions of consumers, your “classic” statistical significance tests would not work (this is because the number of samples is used to compute those classic statistical tests). Do some research online and propose what significance test would you do in case you had “big data”? [5 pts]

__Answer:__ Alternative approaches for testing significance with big data:  
i.	Effect Size Measures: Focus on the magnitude of differences to identify meaningful effects.  
ii.	Resampling Methods: Techniques like bootstrapping adjust for large data peculiarities.  
iii.	Bayesian Methods: These incorporate prior knowledge and adapt well to complex data.  
iv.	Data Splitting: Analyze subsets of data to avoid detecting insignificant differences.  
v.	Adjusted Statistical Techniques: Such as the corrected resampled t-test, which reduces Type I errors prevalent in large datasets.  


---

#### 2. Evaluate the average purchase rates in the test and control for the following groups. For each comparison, report the average purchase rate for the test, average purchase rate for the control and the absolute difference (not the % difference) between the test and control.

In [56]:
# group data and calculate average purchase rates for each segment

# a. All customers
all_customers = data.groupby('test')['purchase'].mean()

# b. Male vs. Female customers
gender_purchase = data.groupby(['test', 'gender'])['purchase'].mean().unstack()

# c. Gamers vs. Non-Gamers Customers
gamer_purchase = data.groupby(['test', 'gamer'])['purchase'].mean().unstack()

# d. Female Gamers vs. Male Gamers
female_gamers = data[(data['gender'] == 0) & (data['gamer'] == 1)].groupby('test')['purchase'].mean()
male_gamers = data[(data['gender'] == 1) & (data['gamer'] == 1)].groupby('test')['purchase'].mean()

# print results
# all_customers, gender_purchase, gamer_purchase, female_gamers, male_gamers

__Comparison 1: All Customers__  
Control Group: 3.62%  
Test Group: 7.68%  
Absolute Difference: 7.68% − 3.62% = 4.06%  
 
 --
 
__Comparison 2: Male vs. Female Customers__  
Control Group:  
Female: 3.44%  
Male: 3.72%  

Test Group:  
Female: 8.09%  
Male: 7.46%  

Absolute Differences:  
Female: 8.09% − 3.44% = 4.65%  
Male: 7.46% − 3.72% = 3.74%  

--

__Comparison 3: Gamers vs. Non-Gamers Customers__  
Control Group:  
Non-Gamers: 3.74%  
Gamers: 3.54%  

Test Group:  
Non-Gamers: 3.51%  
Gamers: 10.45%  

Absolute Differences:  
Non-Gamers: 3.51% − 3.74% = −0.23%  
Gamers: 10.45% − 3.54% = 6.91%  

--

__Comparison 4: Female Gamers vs. Male Gamers__  
Control Group:  
Female Gamers: 3.20%  
Male Gamers: 3.73%  

Test Group:  
Female Gamers: 11.01%   
Male Gamers: 10.14%   

Absolute Differences:    
Female Gamers: 11.01% − 3.20% = 7.81%  
Male Gamers: 10.14% − 3.73% = 6.41%


---

#### 3. Assess the expected revenue in the test vs. control for the following comparisons:

In [64]:
# group data and calculate expected revenue for each segment

# a. All customers
all_customers = data.groupby('test')['purchase'].mean() * 37.5

# b. Female Gamers vs. Male Gamers
female_gamers = data[(data['gender'] == 0) & (data['gamer'] == 1)].groupby('test')['purchase'].mean() * 37.5
male_gamers = data[(data['gender'] == 1) & (data['gamer'] == 1)].groupby('test')['purchase'].mean() * 37.5

# print results
# all_customers, female_gamers, male_gamers

__Comparison 1: All Customers__   
Test Group Avg.Revenue: 1.357991  
Control Group Avg.Revenue: 2.880816   

--

__Comparison 2: Female Gamers vs. Male Gamers__  
Female Gamers:  
Test Group Avg.Revenue: 1.201543  
Control Group Avg.Revenue: 4.128440  

Male Gamers:  
Test Group Avg.Revenue: 1.397815  
Control Group Avg.Revenue: 3.802652  
 

---

#### 4. Based on your previous answers, provide a brief recommendation to your management team summarizing the expected financial outcome for Game-Fun.
a. Should Game-Fun run this promotion again in the future? If no, explain why. If yes, should Game-Fun offer it to all customers or a targeted segment. [10 pts]

__Answer:__ The test group showed a significant increase in both the purchase rates and the average revenue in all segments compared to the control group. Thus, it is advisable for Game-Fun to run this promotion again, given its overall success in boosting the sales.  

Instead of offering the promotion to all customers, a more targeted approach should be adopted. Focus should be on gamers, especially female gamers, who demonstrated the highest uplift in engagement and purchases.

---

### __E2. Non-Compliance in Randomized Experiments__

#### 1. The first data scientist advised that one should compare the survival rate of babies whose mothers were offered Vitamin A shots to the survival rate of babies whose mothers were not offered a Vitamin A shot.
a. What percent of babies whose mothers were offered Vitamin A shots for their babies died? [3 pts]


In [67]:
# read the file
path = "/Users/yumi/ucdavis/Spring Quarter/BAX-423/Homework/Homework 1/sommer_deger.csv"
data = pd.read_csv(path)

In [68]:
data.head()

Unnamed: 0,instrument,treatment,outcome
0,0,0,0
1,0,0,0
2,0,0,0
3,0,0,0
4,0,0,0


instrument (equals one if the mother was offered a vitamin A shot for her baby)  
treatment (equals one if the baby got the vitamin A shot)  
outcome (equals one if the baby did not survive)  

In [75]:
# subset the data based on whether the mothers were offered Vitamin A shots or not
offered_vitamin_a = data[data['instrument'] == 1]
not_offered_vitamin_a = data[data['instrument'] == 0]

mortality_offered = offered_vitamin_a['outcome'].mean() * 100
print('Answer: ', round(mortality_offered,2), 'percent of babies whose mothers were offered Vitamin A shots for their babies died.')

Answer:  0.38 percent of babies whose mothers were offered Vitamin A shots for their babies died.


<hr style="border:0.5px solid gray">

b. What percent of babies whose mothers were not offered Vitamin A shots for their babies died? [3 pts]


In [76]:
mortality_not_offered = not_offered_vitamin_a['outcome'].mean() * 100
print('Answer: ', round(mortality_not_offered,2), 'percent of babies whose mothers were not offered Vitamin A shots for their babies died.')

Answer:  0.64 percent of babies whose mothers were not offered Vitamin A shots for their babies died.


<hr style="border:0.5px solid gray">

c. What is the difference in mortality, and under what assumptions is the difference between these two percentages a valid estimate of the causal impact of receiving vitamin A shots on survival? [4 pts]

In [81]:
mortality_difference = mortality_not_offered - mortality_offered
print('Answer: The difference in mortality between these two groups is ', round(mortality_difference,2), '%.')

Answer: The difference in mortality between these two groups is  0.26 %.


To consider the difference in mortality as a valid estimate of the causal impact of receiving Vitamin A shots on survival, we need to assume:  

- __Randomization:__ The decision to offer Vitamin A shots must be random among the population studied, ensuring that there are no confounding variables influencing both the likelihood of receiving the treatment and the outcome.  

- __Compliance:__ The data should ideally reflect perfect compliance, meaning all who were offered the shots received them, and none who were not offered received them otherwise.  

- __No Spillover Effects:__ There should be no influence of treated individuals on untreated ones, and no external factors influencing the outcomes other than the treatment.   

- __No Measurement Errors:__ Outcomes and treatment status should be accurately measured and recorded.  

Under these conditions, the calculated difference can be interpreted as an estimate of the causal effect of the intervention (offering Vitamin A shots) on survival rates. Without these conditions being met, the observed differences might be attributable to other factors not controlled for in the analysis.  

---

#### 2. The second data scientist advised that one should compare the survival rates of babies who received Vitamin A shots to babies who did not receive Vitamin A shots.

a. What percent of babies who received Vitamin A shots died? [3pts]  

In [82]:
# subset the data based on whether the babies received Vitamin A shots or not
received_vitamin_a = data[data['treatment'] == 1]
not_received_vitamin_a = data[data['treatment'] == 0]

mortality_received = received_vitamin_a['outcome'].mean() * 100
print('Answer: ', round(mortality_received,2), 'percent of babies who received Vitamin A shots died.')

Answer:  0.12 percent of babies who received Vitamin A shots died.


<hr style="border:0.5px solid gray">

b. What percent of babies who did not receive Vitamin A shots died? [3pts]   

In [83]:
mortality_not_received = not_received_vitamin_a['outcome'].mean() * 100
print('Answer: ', round(mortality_not_received,2), 'percent of babies who did not receive Vitamin A shots died.')

Answer:  0.77 percent of babies who did not receive Vitamin A shots died.


<hr style="border:0.5px solid gray">

c. What is the difference in mortality, and under what assumptions is the difference
between these two percentages a valid estimate of the causal impact of receiving vitamin A shots on survival? [4 pts] 

In [84]:
mortality_difference = mortality_not_received - mortality_received
print('Answer: The difference in mortality between these two groups is ', round(mortality_difference,2), '%.')

Answer: The difference in mortality between these two groups is  0.65 %.


The difference between these two percentages can be considered a valid estimate of the causal impact of receiving Vitamin A shots on survival under specific assumptions:  

- __Random Assignment:__ If the treatment (Vitamin A shots) was randomly assigned to the babies, then the treatment and control groups would likely be similar in all aspects except for the intervention. This helps in attributing any differences in outcomes directly to the treatment.  
- __No Unmeasured Confounders:__ There should be no unmeasured variables that could influence both the likelihood of receiving the treatment and the survival outcome.    
- __Sufficient Sample Size:__ A large enough sample size to ensure that observed differences are statistically significant and not due to random chance.  
- __No Spillover Effects:__ The treatment on one baby should not affect the outcomes of another baby who did not receive the treatment.  

If these conditions are met, then the observed difference in mortality rates can be interpreted as the causal effect of the Vitamin A shots on survival.

---

#### 3. The third data scientist advised that one should consider only babies whose mothers were offered Vitamin A shots, and compare babies who received shots to babies who did not receive shots.
a. What percent of babies who received Vitamin A shots died? [3pts]

In [90]:
offered_vitamin_a = data[data['instrument'] == 1]

mortality_offered = offered_vitamin_a.groupby('treatment')['outcome'].mean() * 100

print('Answer: ', round(mortality_offered[1],2), 'percent of babies who received Vitamin A shots died.')

Answer:  0.12 percent of babies who received Vitamin A shots died.


<hr style="border:0.5px solid gray">

b. What percent of babies whose mothers were offered Vitamin A shots, but the mothers did not accept them, died? [3 pts]


In [91]:
mortality_offered = offered_vitamin_a.groupby('treatment')['outcome'].mean() * 100

print('Answer: ', round(mortality_offered[0],2), 'percent of babies whose mothers were offered Vitamin A shots, but the mothers did not accept them, died.')

Answer:  1.41 percent of babies whose mothers were offered Vitamin A shots, but the mothers did not accept them, died.


<hr style="border:0.5px solid gray">

c. What is the difference in mortality, and under what assumptions is the difference
between these two percentages a valid estimate of the causal impact of receiving
vitamin A shots on survival? [4 pts]

In [94]:
mortality_difference_offered = mortality_offered[0] - mortality_offered[1]
print('Answer: The difference in mortality between these two groups is ', round(mortality_difference_offered,2), '%.')

Answer: The difference in mortality between these two groups is  1.28 %.


The difference between these two percentages can be considered a valid estimate of the causal impact of receiving Vitamin A shots on survival under specific assumptions:  

- __Random Compliance:__ It's assumed that the decision to accept the shot (after being offered) is as good as random among those offered. This minimizes selection bias, where only certain types of individuals accept or refuse treatment.  
- __No Unmeasured Confounders:__ This assumes there are no unmeasured variables that could influence both the likelihood of accepting the treatment and the survival outcome, particularly within the subgroup whose mothers were offered the shots.  
-  __Comparable Groups:__ It assumes that the only systematic difference between the two groups (those who accepted and those who did not) is the treatment itself. This means that any other potential differences are random and not confounded by the treatment.  
- __No Spillover Effects:__ Similar to before, it's assumed that the treatment decision for one baby does not influence the outcomes of another baby.  

If these conditions hold, the observed difference in mortality rates can provide a reliable estimate of the causal effect of Vitamin A shots on survival among those offered the treatment. 

---

#### 4. The fourth data scientist suggested the following Wald estimator for the effect of Vitamin A shots on mortality: 
% 𝑜𝑓 𝑏𝑎𝑏𝑖𝑒𝑠 𝑜𝑓𝑓𝑒𝑟𝑒𝑑 𝑠h𝑜𝑡 𝑡h𝑎𝑡 𝑑𝑖𝑒𝑑 − % 𝑜𝑓 𝑏𝑎𝑏𝑖𝑒𝑠 𝑛𝑜𝑡 𝑜𝑓𝑓𝑒𝑟𝑒𝑑 𝑠h𝑜𝑡𝑠 𝑡h𝑎𝑡 𝑑𝑖𝑒𝑑   
𝑜𝑓 𝑏𝑎𝑏𝑖𝑒𝑠 𝑤h𝑜 𝑤𝑒𝑟𝑒 𝑜𝑓𝑓𝑒𝑟𝑒𝑑 𝑎 𝑠h𝑜𝑡 𝑎𝑛𝑑 𝑟𝑒𝑐𝑒𝑖𝑣𝑒𝑑 𝑖𝑡

a. Compute the above Wald estimate for the given dataset. [2pts]

In [118]:
# calculate the necessary percentages for the Wald Estimator
offered_died = data[data['instrument'] == 1]['outcome'].mean() * 100
not_offered_died = data[data['instrument'] == 0]['outcome'].mean() * 100
offered_and_received = (data[(data['instrument'] == 1) & (data['treatment'] == 1)].shape[0] / 
                        data[data['instrument'] == 1].shape[0]) * 100

# calculate the Wald Estimator
wald_estimator = (offered_died - not_offered_died) / offered_and_received

print('Answer: The Wald estimate for the given dataset is', round(wald_estimator, 5))

Answer: The Wald estimate for the given dataset is -0.00323


<hr style="border:0.5px solid gray">

b. Under what assumptions is this estimate a valid estimate of the causal impact of vitamin A shots on survival? [4 pts]

For the Wald estimator to be a valid estimate of the causal impact of vitamin A shots on survival, the following assumptions must hold:  

- __Exogeneity of the Instrument:__ The instrument (being offered a shot) must be uncorrelated with any potential confounders that affect both the likelihood of receiving the treatment and the outcome.   
- __Exclusion Restriction:__ The instrument affects the outcome only through its effect on the treatment. This means that being offered a shot should not affect the baby's survival directly, but only indirectly by influencing whether the baby receives the shot.     
- __Non-zero Average Causal Effect of the Instrument on the Treatment:__ The instrument must have a substantial impact on the probability of receiving the treatment to ensure a robust estimate.   
- __Monotonicity:__ The instrument should not affect the decision to receive the treatment in opposite directions for different individuals. Essentially, offering a shot should not lead anyone who would have otherwise accepted the treatment to refuse it.


<hr style="border:0.5px solid gray">

c. What is the standard error for the intent-to-treat estimate recommended by the
first data scientist? What is the standard error for the Wald estimate recommended by the fourth data scientist? [5 pts]  
i. Which one is larger and why? [4 pts]  
ii. Why might these standard errors be biased? What information would you ideally want to have to address this bias? [5 pts]  

In [129]:
offered_shots_died = data[(data['instrument'] == 1) & (data['outcome'] == 1)].shape[0]
offered_shots_total = data[data['instrument'] == 1].shape[0]
percent_offered_died = (offered_shots_died / offered_shots_total) * 100

not_offered_shots_died = data[(data['instrument'] == 0) & (data['outcome'] == 1)].shape[0]
not_offered_shots_total = data[data['instrument'] == 0].shape[0]
percent_not_offered_died = (not_offered_shots_died / not_offered_shots_total) * 100

# Intent-to-treat standard error
p1 = percent_offered_died / 100
p2 = percent_not_offered_died / 100
se_intent_to_treat = np.sqrt(p1*(1-p1)/offered_shots_total + p2*(1-p2)/not_offered_shots_total) * 100

# calculate the standard error for the Wald estimate
p1 = offered_died/100
p2 = not_offered_died/100
p_treat = offered_and_received/100
se_wald = np.sqrt((p1 * (1 - p1) / nobs_itt[0]) + (p2 * (1 - p2) / nobs_itt[1])) / p_treat

print('Answer: The standard error for the intent-to-treat estimate recommended by the first data scientist is', round(se_intent_to_treat,5), '%.')
print('Answer: The standard error for the Wald estimate recommended by the fourth data scientist is', round(se_wald,5), '%.')

Answer: The standard error for the intent-to-treat estimate recommended by the first data scientist is 0.09278 %.
Answer: The standard error for the Wald estimate recommended by the fourth data scientist is 0.00116 %.


i. The standard error for the intent-to-treat estimate is larger than that for the Wald estimate. This is often the case as the Wald estimator can be more efficient but assumes a strong instrument, which might not always hold true in practice.  

ii. These standard errors could be biased due to deviations from the assumptions required by the respective estimators. Incomplete compliance, unmeasured confounders, or incorrect assumptions about the instrumental variable could all bias these estimates.  

Ideally, I want more detailed data on compliance mechanisms, additional variables that could be potential confounders, and perhaps a more rigorous method to ensure that the instrument is strong and valid to address these biases.

---

### __E3: Causal Inference in Observational Studies__

__Answer:__   
1) Resonance with Objective Study Design: Rubin emphasizes designing studies like randomized trials, using pre-treatment data to form groups without seeing outcome data. This method to avoid bias really aligns with the strict standards I learned about in previous statistics courses.    


2) No Points of Disagreement: Rubin's study is compelling to me.   


3) Comparison with Previous Learning on Propensity Scores: Rubin's explanation of propensity scores to ensure group comparability deepens my understanding beyond the theoretical discussions from our classes. His practical critique of common mispractices provides useful real-world insights.    


4) Implications for Research Integrity: The paper emphasizes the ethical dimension of research design, advocating for a clear separation between the design and analysis phases to prevent bias. This lesson on the importance of ethical research practices is a critical takeaway for conducting scientifically valid research.  