In [1]:
version = "REPLACE_PACKAGE_VERSION"

# Experiment Design and Analysis
## School of Information, University of Michigan

## Week 4: 
- 1. Threats to Validity
- 2. Instrumental Variables

## Assignment Overview
### The objective of this assignment is to:

- Apply theory of experiment design and knowledge of analysis techniques to real experiment data.


### The total score of this assignment will be 25 points


### Resources:
- StatsModels
    - We recommend using a python library called [StatsModels](https://www.statsmodels.org/stable/index.html) for data analysis


- Dataset used in this assignment: Kiva Crowdsourcing Team data [view source files](https://www.openicpsr.org/openicpsr/project/100358/version/V2/view)
    - Source for dataset: [Chen, Y., et al. Recommending teams promotes prosocial lending in online microfinance (2016).](https://www.pnas.org/content/113/52/14944)

In [16]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from linearmodels.iv import IV2SLS #you may get a deprecation warning for this library -- this is fine.

data = pd.read_csv('assets/assignment4_data.csv') #Data for this assignment

In [17]:
#uncomment the below line to view readme files for this dataset (includes explanation of variable names)
!cat assets/assignment4_data_readme.md

#uncomment the below line to view snippet of csv file
#data.head()

### Assignment Topic: Data analysis of a field experiment on Kiva

### Background:

We upload data files from a field experiment conducted by the University of Michigan on the Kiva platform. They are in csv format to import into the Jupyter Notebook.

### Data:

The data file contains eight variables for 64,800 subjects. Below are descriptions of each field in the file:

    - shuffled_lender_id: a de-identified identifier to represent unique subjects.
    
    - treatment_id: the type of treatment the subjects received as part of the experiment.
    
        1. No-Contact
        
        2. Team-Exist
        
        3. Location-Explanation
        
        4. Location-NoExplanation
        
        5. History-Explanation
        
        6. History-NoExplanation
        
        7. Leaderboard-Explanation
        
        8. Leaderboard-NoExplanation
        
    - join: Whether a lender joined a team after the experiment.
    
    - join_rec: Whethe

In [18]:
data.head()

Unnamed: 0,shuffled_lender_id,treatment_id,join,join_rec,opened,amt_diff_1d,amt_diff_7d,amt_diff_30d
0,0,7,0,0,1,0.0,0.0,-50.0
1,1,3,0,0,1,0.0,0.0,0.0
2,2,4,0,0,1,0.0,0.0,0.0
3,3,2,0,0,0,0.0,0.0,0.0
4,4,5,0,0,0,0.0,0.0,0.0


## Part A (15 points)

We want to assess the effectiveness of joining a team on Kiva -- specifically, what impact joining a team has on donations. Using the variable indicating whether users have joined a team (```join```) and the differences in donations made over a certain period (```amt_diff_1d```, ```amt_diff_7d```, ```amt_diff_30d```), we can find if joining a team has impact on donations. However, variables that determine whether subjects join a team may also affect the amount they donate since we only inform subjects about joining a team.

In this case, our instrumental variable is whether the subjects were sent an e-mail to inform about the team functionality on Kiva. 

***Before you go on, recall from lecture the requirements an instrumental variable must satisfy - you will be investigating these requirements in this notebook.***

1. To get started, we need to create the instrumental variable. Add a new column named ```email``` in the dataframe. The value for ```email``` should be ```1``` if users received an e-mail as part of their treatment group (```treatment_id``` != 1), and ```0``` if they did not (```treatment_id``` = 1). (2 points)

In [19]:
data.rename(columns={'join': 'join_any'}, inplace=True) #since join is also the name of a pandas method, we rename the column to avoid confusion
# YOUR CODE HERE

data['email'] = np.where(data['treatment_id']!=1, 1, 0)

In [20]:
assert pd.notnull(data['email'].all()), "email column must contain either 0 or 1"

In [21]:
assert data.loc[data['treatment_id'] != 1,'email'].all() == 1, "all treatments except treatment 1 received an email"
assert data.loc[data['treatment_id'] == 1,'email'].all() == 0, "all treatments except treatment 1 received an email"

2. Next, we will create a constant, equal to 1. Add a column in the dataframe called ```const```. (1 point)

In [22]:
# YOUR CODE HERE


data['const'] = 1

In [23]:
assert data['const'].all() == 1, "the constant value should be 1"

In lecture, the 2-stage least squares model was used. Now, let’s follow the steps indicated in lecture to create this model to measure the effect described above. First, we need to estimate the effect of e-mailing users to join a team on whether they join a team.


3. Using statsmodels, create an ordinary least squares regression model that does this. Fit the model and store it in the variable: ```model_fs```. Using the predict method from ```model_fs```, store the predicted values in a new column in your dataframe called ```predicted_join```. Recall from lecture that since we have created this new variable, we can estimate the effect of joining a team on lending amounts without worrying about the effect of potential unobserved or missing variables. (4 points)

Note: ensure your model has a constant.

In [30]:
def email_join_ols(provided_data):
    
    Y = provided_data['join_any']
    X = provided_data['email']
    X = sm.add_constant(X)
    model = sm.OLS(Y,X)
    model_fs = model.fit()
    
    predicted = model_fs.predict()
    
    provided_data['predicted_join'] = predicted
    
    return provided_data

Your function should return a dataframe with the correct values and columns. Check that it does:

In [31]:
email_join_ols(data).head() #you may get a deprecation warning here -- that is ok.

Unnamed: 0,shuffled_lender_id,treatment_id,join_any,join_rec,opened,amt_diff_1d,amt_diff_7d,amt_diff_30d,email,const,predicted_join
0,0,7,0,0,1,0.0,0.0,-50.0,1,1,0.009801
1,1,3,0,0,1,0.0,0.0,0.0,1,1,0.009801
2,2,4,0,0,1,0.0,0.0,0.0,1,1,0.009801
3,3,2,0,0,0,0.0,0.0,0.0,1,1,0.009801
4,4,5,0,0,0,0.0,0.0,0.0,1,1,0.009801


In [32]:
assert 'predicted_join' in data, "checking there is a column named predicted_join in data"

In [33]:
"""checking the correct predicted_join values are present"""
# Hidden tests

'checking the correct predicted_join values are present'

Now that we have the predicted values of whether a subject would be expected to join a team based on if they were e-mailed, we can move to the second stage.

4. In this stage, we will run the estimation of the effect of joining a team on the amount a subject lends. However, instead of using the ```join``` variable, we will use our new ```predicted_join``` variable. Using statsmodels again, and ensuring your model has a constant, create three ordinary least squares regression models which estimate the effect of the prediction of users joining a team on the following:

a. ```amt_diff_1d```, storing the fitted model in ```model_1d``` (3 points)

In [34]:
def pred_join_amt_1d(provided_data):

    
    data = email_join_ols(provided_data)
    
    Y = data['amt_diff_1d']
    X = data['predicted_join']
    X = sm.add_constant(X)
    model = sm.OLS(Y,X)
    model_1d = model.fit()
    
    
    return model_1d

Your function should return a summary view of your results. Check that it does:

In [35]:
print(pred_join_amt_1d(data).summary()) #we've wrapped this in a print statement to preserve the original statsmodels layout. You may get a deprecation warning here -- that is ok.

                            OLS Regression Results                            
Dep. Variable:            amt_diff_1d   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                  0.001
Method:                 Least Squares   F-statistic:                     56.75
Date:                Sun, 21 Nov 2021   Prob (F-statistic):           5.01e-14
Time:                        20:54:27   Log-Likelihood:            -2.8014e+05
No. Observations:               64800   AIC:                         5.603e+05
Df Residuals:                   64798   BIC:                         5.603e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const             -2.6593      0.367     -7.

In [36]:
"""checking your t-value is correct"""
# Hidden tests

'checking your t-value is correct'

b. ```amt_diff_7d```, storing the fitted model in ```model_7d``` (3 points)

In [37]:
def pred_join_amt_7d(provided_data):
    # YOUR CODE HERE
    
    
    data = email_join_ols(provided_data)
    Y = data['amt_diff_7d']
    X = data['predicted_join']
    X = sm.add_constant(X)
    model = sm.OLS(Y,X)
    model_7d = model.fit()
    
    return model_7d

Your function should return a summary view of your results. Check that it does:

In [38]:
print(pred_join_amt_7d(data).summary())

                            OLS Regression Results                            
Dep. Variable:            amt_diff_7d   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     9.979
Date:                Sun, 21 Nov 2021   Prob (F-statistic):            0.00158
Time:                        20:57:29   Log-Likelihood:            -3.5400e+05
No. Observations:               64800   AIC:                         7.080e+05
Df Residuals:                   64798   BIC:                         7.080e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const             -6.5511      1.148     -5.

In [39]:
"""checking your t-value is correct"""
# Hidden tests

'checking your t-value is correct'

c. ```amt_diff_30d```, storing the fitted model in ```model_30d``` (2 points)

In [40]:
def pred_join_amt_30d(provided_data):
    # YOUR CODE HERE

        
    data = email_join_ols(provided_data)
    Y = data['amt_diff_30d']
    X = data['predicted_join']
    X = sm.add_constant(X)
    model = sm.OLS(Y,X)
    model_30d = model.fit()
    
    
    
    return model_30d

Your function should return a summary view of your results. Check that it does:

In [41]:
print(pred_join_amt_30d(data).summary())

                            OLS Regression Results                            
Dep. Variable:           amt_diff_30d   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     2.112
Date:                Sun, 21 Nov 2021   Prob (F-statistic):              0.146
Time:                        21:05:12   Log-Likelihood:            -3.8856e+05
No. Observations:               64800   AIC:                         7.771e+05
Df Residuals:                   64798   BIC:                         7.771e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const             -7.0699      1.957     -3.

In [42]:
"""checking your t-value is correct"""
# Hidden tests

'checking your t-value is correct'

## Part B (10 points)

Now we have estimated the effect of joining a team on lending using instrumental variables! However, there is a more direct way to complete these two stages.

Using the IV2SLS (Instrumental Variables 2-Stage Least Squares) function ([Documentation](https://bashtage.github.io/linearmodels/iv/iv/linearmodels.iv.model.IV2SLS.html)) in the linearmodels library will achieve everything we did above faster -- and it will more correctly estimate the standard errors.

1. First, let's make things simpler for ourselves. For this analysis, we will only need a dataframe with the following columns: ```email```, ```join_any```, ```amt_diff_1d```, ```amt_diff_7d```, ```amt_diff_30d```, and the ```const``` column you created above. (As in our other models, we need a constant included in the models we will be creating with IV2SLS.) (1 point)

In [43]:
iv_dataframe = pd.DataFrame()
iv_dataframe['email'] = data['email']
iv_dataframe['join_any'] = data['join_any']
iv_dataframe['amt_diff_1d'] = data['amt_diff_1d']
iv_dataframe['amt_diff_7d'] = data['amt_diff_7d']
iv_dataframe['amt_diff_30d'] = data['amt_diff_30d']
iv_dataframe['const'] = data['const']


```iv_dataframe``` should yield a dataframe with the correct calculated, given, and new column and row values. Check that it does:

In [44]:
iv_dataframe.head()

Unnamed: 0,email,join_any,amt_diff_1d,amt_diff_7d,amt_diff_30d,const
0,1,0,0.0,0.0,-50.0,1
1,1,0,0.0,0.0,0.0,1
2,1,0,0.0,0.0,0.0,1
3,1,0,0.0,0.0,0.0,1
4,1,0,0.0,0.0,0.0,1


In [45]:
"""checking your dataframe columns are all present"""
assert 'email' and 'join_any'and 'amt_diff_1d' and 'amt_diff_7d' and 'amt_diff_30d'and 'const' in iv_dataframe

After looking over the documentation of the IV2SLS function, create and fit three models (with the three lending measurement periods used in number 3) that estimate the effect of joining a team on lending amounts considering the instrument of emailing subjects about joining a team.

According to the [linearmodels IV2SLS documentation](https://bashtage.github.io/linearmodels/iv/iv/linearmodels.iv.model.IV2SLS.html), you will need to provide a given set of parameters to the function in order to create the model. The dependent variables and instruments are straightforward, but what are exogenous and endogenous regressors?

An exogenous regressor does not co-vary with the model’s random error while an endogenous regressor does. In our model’s case, we know the ```join_any``` variable co-varies with the random error while ```const``` cannot (since it is a constant!).

2. Create and fit the model for the first lending measurement period, the period referred to in ```amt_diff_1d``` (3 points)

In [46]:
def iv_model_1d(provided_data):
    
    """ Take some time to think about what exactly you're modeling here, then read the linearmodels documentation.
    What is the instrument? What is the dependent variable, what are the endogenous and exogenous regressors?
    Tip: the covariance should be unadjusted in this model, (and your following models)
    """
     # YOUR CODE HERE
    iv_result_1d = IV2SLS(provided_data['amt_diff_1d'], provided_data[['const']], provided_data[['join_any']], provided_data['email'], weights=None).fit(cov_type='unadjusted')
   
   

    return iv_result_1d

Your function should return a summary view of your results. Check that it does:

In [47]:
iv_model_1d(iv_dataframe)

0,1,2,3
Dep. Variable:,amt_diff_1d,R-squared:,-2.3237
Estimator:,IV-2SLS,Adj. R-squared:,-2.3237
No. Observations:,64800,F-statistic:,17.060
Date:,"Sun, Nov 21 2021",P-value (F-stat),0.0000
Time:,23:52:58,Distribution:,chi2(1)
Cov. Estimator:,unadjusted,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
const,-2.6593,0.6699,-3.9698,0.0001,-3.9723,-1.3464
join_any,298.56,72.283,4.1304,0.0000,156.89,440.23


In [48]:
"""checking your 1d model has an unadjusted covariance and your p-value is correct"""
# Hidden tests

'checking your 1d model has an unadjusted covariance and your p-value is correct'

3. Create and fit the model for the first lending measurement period, the period referred to in ```amt_diff_7d``` (3 points)

In [49]:
def iv_model_7d(provided_data):
    # YOUR CODE HERE

    iv_result_7d = IV2SLS(provided_data['amt_diff_7d'], provided_data[['const']], provided_data[['join_any']], provided_data['email'], weights=None).fit(cov_type='unadjusted')
    
    
    
    return iv_result_7d

Your function should return a summary view of your results. Check that it does:

In [50]:
iv_model_7d(iv_dataframe)

0,1,2,3
Dep. Variable:,amt_diff_7d,R-squared:,-0.4152
Estimator:,IV-2SLS,Adj. R-squared:,-0.4153
No. Observations:,64800,F-statistic:,7.0505
Date:,"Sun, Nov 21 2021",P-value (F-stat),0.0079
Time:,23:56:12,Distribution:,chi2(1)
Cov. Estimator:,unadjusted,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
const,-6.5511,1.3661,-4.7954,0.0000,-9.2286,-3.8735
join_any,391.40,147.41,2.6553,0.0079,102.49,680.31


In [51]:
"""checking your 7d model has an unadjusted covariance and your p-value is correct"""
# Hidden tests

'checking your 7d model has an unadjusted covariance and your p-value is correct'

4. Create and fit the model for the first lending measurement period, the period referred to in ```amt_diff_30d``` (3 points)

In [52]:
def iv_model_30d(provided_data):
    # YOUR CODE HERE

    iv_result_30d = IV2SLS(provided_data['amt_diff_30d'], provided_data[['const']], provided_data[['join_any']], provided_data['email'], weights=None).fit(cov_type='unadjusted')
        
    return iv_result_30d

Your function should return a summary view of your results. Check that it does:

In [53]:
iv_model_30d(iv_dataframe)

0,1,2,3
Dep. Variable:,amt_diff_30d,R-squared:,-0.0806
Estimator:,IV-2SLS,Adj. R-squared:,-0.0807
No. Observations:,64800,F-statistic:,1.9544
Date:,"Mon, Nov 22 2021",P-value (F-stat),0.1621
Time:,00:04:21,Distribution:,chi2(1)
Cov. Estimator:,unadjusted,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
const,-7.0699,2.0347,-3.4746,0.0005,-11.058,-3.0819
join_any,306.93,219.55,1.3980,0.1621,-123.38,737.24


In [None]:
"""checking your 30d model has an unadjusted covariance and your p-value is correct"""
# Hidden tests